Introduction

The following document provides an overview of the TT-MLIR project, with a focus on the technical specifications of an MLIR-based compiler stack. So what exactly is an MLIR-based compiler stack? MLIR (Multi Level Intermediate Representation) is a subproject coming out of the LLVM Project. It seeks to introduce extensibility and sustainable code design to a very modular compiler framework. This essentially means to take a much larger more involved compiler (like LLVM) and split it into sub-compilers that each produce their own Intermediate Representation (IR) of what you've fed the compiler.

Disclaimer: This is intended to be a working document, if you find something incorrect or incomplete please feel free to create a PR.

Motivations

The idea of having a multi-level IR might not seem so far fetched, in fact it resembles some of our current software stacks. The idea of going from a High Level TVM Graph → Lowered PyBUDA Graph → Netlist, with each layer having their own level of optimizations is quite a familiar concept. However, there are problems with the reusability and integration of optimizations for the current software compiler stack. Currently, users are almost forced to choose between a top-down optimization or bottom-up optimization, with both requiring "expert-level" expertise to optimize for desired performance. Developing 2 entirely different projects is taxing, and it's hard to translate the benefits of BUDA over to metal (or the other way around). One of the primary goals of tt-mlir is to enable a consistent programming model between software stacks, concepts for improving optimizations in the compiler stack should 1-1 carry over to hand-written TTNN.

The benefits grow even further when one can understand all the possible entry points that multiple IRs present. Existing MLIR based projects like OpenXLA and torch-mlir can natively output MLIR in a dialect that can be transcribed to the TTIR dialect as well!

What is MLIR and why use it?

MLIR is a compiler infrastructure that is designed to be modular and extensible. The main benefits the tt-mlir project hopes to gain by using MLIR include:

  • Industry Standard Compiler Framework
    • Lots of boilerplate algorithms, data structures, and useful software that is common to compiler development
  • Ecosystem
    • Hook into existing front-end MLIR projects
  • Testing framework
    • A battle-tested test infrastructure that will enable us to write fine grained tests and rely less on end-to-end testing
    • Common IR Serialization Format that's easy to test, debug, and edit

Additional documentation to highlight the benefits of MLIR can be found here:

MLIR: Overview

MLIR is at it's root an interpreter that can parse "readable" text in some .mlir format. The unique properties lie in the modularity of the parsing itself. MLIR is built upon a collection of Dialects, each of these Dialects define a collection of Operations, Types, and Attributes. These dialects follow their own syntax, and they can encode any amount of information. The benefit is that MLIR provides bindings and hooks such that a user can directly translate these IRs into usable artifacts for that layer of complexity. An example of this would be the relatively high level TOSA Dialect, which is used to represent computation over tensors, and then lowering that to a more hardware specific dialect that closely models the programming model of the hardware or underlying backend. It is the dialect system itself which powers the multi-level functionality of MLIR, with different dialects a user can essentially "lower" through their software stack by just transforming between the different dialects for their layers. Dialects can exist in a broad range from purely mathematical dialects, to a LinAlg Dialect, or a Tensorflow Dialect defined for ML Graphs. Each dialect encodes it's own information and their operations can use the Types/Attributes of other dialects as parameters. Multiple dialects are possible in one module, and encouraged to highlight optimizations of different dialects. In our usecase for the TT Stack, MLIR acts a "mid-level" compiler which makes the task of joining together various entry points and backends much simpler.

MLIR Primitives

So what does MLIR look like, how does it work and get parsed? The hierarchy of an MLIR Module is as shown:

#permutation = array<i64: 0, 2, 1>

module {
  func.func @forward(%input: tensor<32x64x128xf32>) -> tensor<32x128x64xf32> {
    %output = ttir.empty() : tensor<32x128x64xf32>
    %result = "ttir.permute"(%input, %output) <{permutation = #permutation}> : (tensor<32x64x128xf32>, tensor<32x128x64xf32>) -> tensor<32x128x64xf32>
    return %result : tensor<32x128x64xf32>
  }
}
  • Attributes (defined using #)

    • The syntax of actually creating an attribute is modular, and custom assembly instructions for different attributes can be applied.
  • Operations

    • These operations are accessed with the . method, so you'll see some examples like func.func or ttir.empty. Each operation also provides it's own assembly instructions but often strictly defines the type of result

    • Quotes are added around ttir.multiply since it's part of a custom dialect.

    • Operations typically have operands (arguments) and results which are highlighted with %, these results and operands help to show the relationship between operations

  • Types

    • Types are shown as dataformats throughout this compiled mlir module, where tensor and array are some examples.

    • They help to demonstrate the transformation of information and it's representation as it's processed across this module.

MLIR Workflow

The overall MLIR workflow doesn't involve writing .mlir files, not necessarily even modifying them. The Intermediate Representations are truly just representations, we can parse them to demonstrate what the graph looks like at that current stage of optimization, or run a pass through them to optimize certain functions. The overall framework is designed with the following architecture in mind:

  1. Graph Information exists

  2. Graph Information is transformed (through any which method) into a high-level MLIR representation

  3. Passes are run on the high-level implementation to lower into TTIR, a common IR that can be lowered into multiple backends

  4. Depending on the usecase more passes are run to lower to whatever backend the user would like (ex: TTNN Backend)

What are Passes?

Transformations in MLIR are represented as passes that occur during the parsing of some information. These passes can be executed when parsing or generating MLIR modules. These transformations can have a myriad of purposes, and are completely user defined as to how they modify the module. Some examples of passes can be for lowering purposes as mentioned before, where a dialect is parsed and then each operation is transformed to a lowered dialect following some set of user defined rules. Passes are also used for optimizations and backend code transformation in the context of this project. They're a powerful tool and provide most of the functionality to transform between layers of dialects, and they provide a simple platform for modifications of an MLIR module.

Why not make our own?

Now that I've described the functionality of the MLIR framework, it seems like making an in house multi level Intermediate Representation system would be pretty similar, so why are we going through the effort of implementing this framework?

One of the biggest reason can be attributed to the active developer community surrounding the project, being a part of the LLVM Project means that there is solid developer support, and the framework is designed to be a tool for many different paradigms of compute. This scalability and strong mission statement lend to the strengths of MLIR being a solid platform to use as a middle layer in our compiler stack. Furthermore, as a functional benefit of being part of a larger open source project, MLIR has a whole library of tests and infrastructure that we can leverage for solid code health while starting a new project.

Automation

It's not only about developer support, another key benefit of MLIR is that it's built with autogeneration in mind. Through TableGen a lot of the boilerplate of creating this multi-level IR become abstracted away to truly focus on implementation and execution. This automation is built on top of a pre-existing robust framework with a lot of implementations and support from other large players in the ML scene. By integrating with these automation pipelines, we allow for external developers to have a much simpler entry-point into our software stack!

TT-MLIR: Bringing MLIR to the TT Stack

Now that we have defined this pretty cool project, let's look at the implementation details of bringing MLIR (and related optimizations) into the TT Stack. Since it acts as a mid-level compiler we can start by defining the "bottom" and "top" layers of the compiler. BUDA already has a well defined set of frontend optimizations to some TVM defined graph and is knowledgeable of the hardware that these models want to run on. We want to interrupt the BUDA stack to only give us the frontend compiled graph before any hardware specific lowering is to occur. What this will produce is information that is agnostic to different backends and their execution on TT hardware, but this is still valid information to optimize at different levels for later compilation. The "bottom" of our graph is now defined as the backend that will produce the machine-specific code to be executed. While MLIR could allow for any level of complexity downwards for the bottom, we will define a very aggressive TTNN backend for the MVP. Desired Optimization List:

  • Forge-FE (frontend)

    • Graph Optimizations, Constant Folding, Operation Fusion
  • TT-MLIR (mid-level)

    • Data Storage, Memory Configuration, Grid Configuration
  • TT-NN (backend)

    • Kernel Configuration*, Network Optimization

*Subject to Change / Be Moved to TT-MLIR

TT-MLIR Dialects

Now that we have defined the series of optimizations that we would like to see implemented in TT-MLIR, we can begin to help define the dialects that would help to support these different levels of optimizations. For more detail on each of these dialects, please refer to the GitHub Wiki and TableGen descriptors. I think that Nick does a great job of documenting the key functionality.

TT Dialect

The TT Dialect is only for common Types and Attributes used throughout the many levels of the mid level compiler.

TTIR Dialect

The TTIR Dialect is defined as the common dialect for TT-MLIR, as such it doesn't define anything hardware/backend specific. It lists out general actions that would take place on TT hardware such as dispatch, layout, and kernel operations.

Generic Operation

This is one of two operations that's crucial to understand the intended optimization characteristics of the TTIR Dialect. The generic operation dictates the actions that would be taken to dispatch some instruction to TT hardware such that it executes some instruction. Parametrically, the operation consumes inputs, outputs, maps to read the tensors, and access-types to the memory. These parameters highlight the optimizations that can be performed at this level to change the location of the memory, transpose using variant access maps, or even the grid upon which the computation takes place. The operation also contains a block in which the exact behaviour for that operation to occur is stored.

Layout Operation

The layout operation is key in describing the storage of memory throughout the execution graph. Layout determines the sharding spec, location of the memory, data types, and tile sizes of some tensor. While generic describes the dispatch for some data-wise transformation to take place, the data itself is laid out across the chip through the layout operation.

Both of these operations describe the key functionality of the TTIR dialect and the optimization space that it provides.

Built-in MLIR Dialects

The functionality of TT-MLIR Dialects also depends / is inspired by the functionality of Built-in MLIR Dialects like Affine and LinAlg. Below are summaries of some of the key members of these Dialects

Affine Dialect

[Reference] Affine maps help to describe transformations on coordinate systems, while this may not really make sense, imagine trying to index a rank 2 tensor. By getting t[x, y] I can access the element in the Xth row and Yth column, but if I wanted to transpose the tensor I might have to re-layout the entire tensor such that the data would be accessible using t[x, y] to get the element in the Yth row and Xth column. This transpose can also be represented using an Affine Map to transform (x, y) -> (y, x) and this would let the tensor data remain in place while the access method is modified. This extends even further to more complex transformations such that stride lengths or unique indexing methods can be implemented without complicated manipulation.

Tensor Dialect

[Reference] The tensor dialect defines the functionality and Type of the fundamental Tensor. This dialect contains members that would represent manipulation and representation of tensors as multi-dimensional data with shapes and datatypes. Not much else is different about this dialect, the reference covers key topics if implementation details are needed.

Func Dialect

[Reference]

TOSA Dialect

[Reference]

SCF Dialect

[Reference]

EmitC Dialect

[Reference]

TT-Explorer - Performance Optimization Tool

A unique project related to TT-MLIR is the integration of Performance Optimization Tools such that users are easily able to visualize and readily tune their models without needing an expert level understanding of the tech stack. TT-Explorer is built with Google AI's Model Explorer as a base for the visualization tool, and a custom adapter to parse TT-MLIR projects. This would allow users to readily tune their models, and optimize for the TTIR layer (ex: they can change certain memory to be laid out in L1 instead of DRAM, or change the grid layout of an operation to be larger than what was previously assigned). After compilation with these overrides, the runtime information can then be fed directly into a Tracy Performance Analysis for the user to visualize the impacts of their tuning, seeing which operations were least performant and continuing in a gamified design loop for iterative performance tuning!

Getting Started

This page walks you through the steps required to set up tt-mlir.

NOTE: If you have a build issue, you can file a bug here.

Prerequisites

Hardware Setup

Use this guide to set up your hardware - Hardware Setup.

System Dependencies

You can use tt-mlir with Ubuntu or Mac OS, however the runtime does not work on Mac OS. tt-mlir project has the following system dependencies:

  • Ubuntu 22.04 OS or Mac OS
  • Clang >= 14 & <= 18
  • Ninja
  • CMake 3.24 or higher
  • Python 3.10
  • python3.10-venv
  • openmpi

Ubuntu

Install Clang, Ninja, CMake, and python3.10-venv:

sudo apt install git clang cmake ninja-build pip python3.10-venv
wget -q https://github.com/dmakoviichuk-tt/mpi-ulfm/releases/download/v5.0.7-ulfm/openmpi-ulfm_5.0.7-1_amd64.deb -O /tmp/openmpi-ulfm.deb && sudo apt install /tmp/openmpi-ulfm.deb

You should now have the required dependencies installed.

NOTE: If you intend to build with runtime enabled (-DTTMLIR_ENABLE_RUNTIME=ON), you also need to install tt-metal dependencies which can be found here.

Mac OS

On MacOS we need to install the latest version of cmake, and ninja which can be done using Homebrew with (Docs for installing Homebrew: https://brew.sh).

brew install cmake ninja

Clone the tt-mlir Repo

  1. Clone the tt-mlir repo:
git clone https://github.com/tenstorrent/tt-mlir.git
  1. Navigate into the tt-mlir folder.

Environment Setup

There are two ways to set up the environment, either using a docker image or building the environment manually. The docker image is recommended since it is easier to set up and use.

Using a Docker Image

Please see Docker Notes for details on how to set up and use the docker image.

Once you have the docker image running and you are logged into the container, you should be ready to build.

Setting up the Environment Manually

This section explains how to manually build the environment so you can use tt-mlir. You only need to build this once, it builds llvm, flatbuffers, and a Python virtual environment. You can specify the LLVM build type by using -DLLVM_BUILD_TYPE=*. The default is MinSizeRel, and available options are listed here.

  1. Navigate into the tt-mlir folder.

  2. The environment gets installed into a toolchain directory, which is by default set to /opt/ttmlir-toolchain, but can be overrideen by setting (and persisting in your environment) the environment variable TTMLIR_TOOLCHAIN_DIR. You need to manually create the toolchain directory as follows:

export TTMLIR_TOOLCHAIN_DIR=/opt/ttmlir-toolchain/
sudo mkdir -p /opt/ttmlir-toolchain
sudo chown -R $USER /opt/ttmlir-toolchain
  1. Please ensure that you do not already have an environment (venv) activated before running the following commands:
cmake -B env/build env
cmake --build env/build
source env/activate

NOTE: The last command takes time to run, so give it time to complete.

Building the tt-mlir Project

In this step, you build the tt-mlir project:

source env/activate
cmake -G Ninja -B build
cmake --build build

You have now configured tt-mlir.

You can add different flags to your build. Here are some options to consider:

  • To enable the ttnn/metal runtime add -DTTMLIR_ENABLE_RUNTIME=ON. Clang 17 is the minimum required version when enabling the runtime.
  • To enable the ttnn/metal perf runtime add -DTT_RUNTIME_ENABLE_PERF_TRACE=ON.
  • To accelerate the builds with ccache use -DCMAKE_CXX_COMPILER_LAUNCHER=ccache.
  • To workaround OOM issues it can be useful to decrease the number of parallel jobs with -DCMAKE_BUILD_PARALLEL_LEVEL=4.
  • If Python bindings aren't required for your project, you can accelerate builds further with the command -DTTMLIR_ENABLE_BINDINGS_PYTHON=OFF.
  • The TTNN build is automatically integrated / handled by the tt-mlir cmake build system. For debugging and further information regarding the TTNN backend build step, please refer to TTNN Documentation.
  • The runtime build depends on the TT_METAL_HOME variable, which is also set in env/activate script. For more information, please refer to TT-NN and TT-Metailium installation documentation.
OSOffline Compiler OnlyRuntime Enabled BuildRuntime + Perf Enabled Build
Ubuntu 22.04
Ubuntu 20.04
MacOS

Test the Build

Use this step to check your build. Do the following:

source env/activate
cmake --build build -- check-ttmlir

Lint

Set up lint so you can spot errors and stylistic issues before runtime:

source env/activate
cmake --build build -- clang-tidy

Note for developers: You can run:

source env/activate
cmake --build build -- clang-tidy-ci

This reproduces the Lint (clang-tidy) CI job. It runs clang-tidy only on committed files that have been modified relative to the origin/main branch.

Pre-Commit

Pre-Commit applies a git hook to the local repository such that linting is checked and applied on every git commit action. Install from the root of the repository using:

source env/activate
pre-commit install

If you have already committed before installing the pre-commit hooks, you can run on all files to "catch up":

pre-commit run --all-files

For more information visit pre-commit

Docs

Build the documentation by doing the following:

  1. Make sure you have mdbook and doxygen installed.

  2. Build the docs:

source env/activate
cmake --build build -- docs
mdbook serve build/docs

NOTE: mdbook serve will by default create a local server at http://localhost:3000.

Common Build Errors

TTMLIRPythonCAPI target requires changing an RPATH

CMake Error at /opt/ttmlir-toolchain/lib/cmake/llvm/AddLLVM.cmake:594 (add_library):
  The install of the TTMLIRPythonCAPI target requires changing an RPATH from
  the build tree, but this is not supported with the Ninja generator unless
  on an ELF-based or XCOFF-based platform.  The
  CMAKE_BUILD_WITH_INSTALL_RPATH variable may be set to avoid this relinking
  step.

If you get the above error, it means you tried to build with an old version of cmake or ninja and there is a stale file. To fix this, rm -rf your build directory, install a newer version of cmake/ninja, and then rebuild. If you installed ninja via sudo apt install ninja-build, it might still be not up-to-date (v1.10.0). You may use ninja in the python virtual environment, or install it via pip3 install -U ninja, either way the version 1.11.1.git.kitware.jobserver-1 should work.

clang++ is not a full path and was not found in the PATH

CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_CXX_COMPILER:
    clang++
  is not a full path and was not found in the PATH.
  Tell CMake where to find the compiler by setting either the environment
  variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
  to the compiler, or to the compiler name if it is in the PATH.
CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_C_COMPILER:
    clang
  is not a full path and was not found in the PATH.
  Tell CMake where to find the compiler by setting either the environment
  variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
  the compiler, or to the compiler name if it is in the PATH.

If you get the following error, it means you need to install clang which you can do with sudo apt install clang on Ubuntu.

tt-metal Update Failures

Failed to unstash changes in: '/path/to/tt-metal/src/tt-metal'
You will have to resolve the conflicts manually

This error occurs during CMake's ExternalProject update of tt-metal. The build system tries to apply changes using Git's stash mechanism, but fails due to conflicts. This can happen even if you haven't manually modified any files, as the build process itself may leave behind artifacts or partial changes from previous builds.

To resolve, run the following command:

rm -rf third_party/tt-metal

Then retry your build command. If the error persists, you may need to do the following:

  1. Remove the build directory: rm -rf build

  2. Run CMake commands again.

  3. Run the above.

Common Runtime Errors

Debugging Python on Mac OS

When debugging python on macOS via lldb you may see an error like:

(lldb) r
error: process exited with status -1 (attach failed (Not allowed to attach to process.  Look in the console messages (Console.app), near the debugserver entries, when the attach failed.  The subsystem that denied t
he attach permission will likely have logged an informative message about why it was denied.))

For preinstalled macOS binaries you must manually codesign with debug entitlements.

Create file debuggee-entitlement.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
        <key>com.apple.security.cs.disable-library-validation</key>
        <true/>
        <key>com.apple.security.get-task-allow</key>
        <true/>
</dict>
</plist>

Sign the binary:

sudo codesign -f -s - --entitlements debuggee-entitlement.xml /opt/ttmlir-toolchain/venv/bin/python

Working with Docker Images

Components:

  • Dockerfile
  • Workflow for building Docker image
  • Project build using Docker image

Overview

We use docker images to prepare the project environment, install dependencies, tooling and prebuild toolchain. Project builds four docker images:

Base image starts with a supported base image (Ubuntu 22.04) and installs dependencies for project build. From there, we build the CI image that contains the prebuild toolchain and is used in CI to shorten the build time. The IRD image contains dev tools such as GDB, vim and ssh which are used in IRD environments.

During the CI Docker build, the project is built and tests are run to ensure that everything is set up correctly. If any dependencies are missing, the Docker build will fail.

Using the Docker Image

Here is a typical command to run the latest developer (ird) docker image:

sudo docker run -it -d --rm \
  --name my-docker \
  --cap-add ALL \
  --device /dev/tenstorrent/0:/dev/tenstorrent/0 \
  -v /dev/hugepages:/dev/hugepages \
  -v /dev/hugepages-1G:/dev/hugepages-1G \
  ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest bash

Special attention should be paid to flags:

  • --device /dev/tenstorrent/0:/dev/tenstorrent/0: this is required to map the hardware device into the container. For machines with multiple devices, this flag can be specified multiple times or adjusted with the appropriate device number.
  • -v /dev/hugepages:/dev/hugepages / -v /dev/hugepages-1G:/dev/hugepages-1G: this is required to map the hugepages volume into the container. For more information on hugepages, please refer to the Getting Started Guide.

The base or CI image can also be used in the same way, but the IRD image is recommended for development.

Using the Docker Image via IRD (Internal Developers Only)

Internally we use a tool called IRD. As part of your reserve command, you can specify the docker image to use:

ird reserve \
  --docker-image ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest

See ird reserve --help for more information on the reserve command. Typical ird usage might look like:

# list machine availability
ird list-machines

# reserve a machine
ird reserve \
  --volumes /localdev/$USER:/localdev/$USER \
  --docker-image ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest \
  --timeout 720 \
  wormhole_b0 \
  --machine [MACHINE_NAME]

# list your currently reserved machines
ird list

# connect to the first reserved machine
ird connect-to 1

# release the first reserved machine
ird release 1

Building the Docker Image using GitHub Actions

The GitHub Actions workflow Build and Publish Docker Image builds the Docker images and uploads them to GitHub Packages at https://github.com/orgs/tenstorrent/packages?repo_name=tt-mlir. We use the git SHA we build from as the tag.

Building the Docker Image Locally

To test the changes and build the image locally, use the following command:

docker build -f .github/Dockerfile.base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ci-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird --build-arg FROM_IMAGE=base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird --build-arg FROM_IMAGE=ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest .

Using the Image in GitHub Actions Jobs

The GitHub Actions workflow Build in Docker uses a Docker container for building:

    container:
      image: ghcr.io/${{ github.repository }}/tt-mlir-ci-ubuntu-22-04:latest
      options: --user root

Testing

To run tests:

source env/activate
cmake --build build -- check-ttmlir

Lit testing

llvm-lit tool is used for MLIR testing. With it you can:

# Query which tests are available
llvm-lit -sv ./build/test --show-tests

# Run an individual test:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR/test_allocate.mlir

# Run a sub-suite:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR

See the full llvm-lit documentation for more information.

EmitC testing

NOTE: This is a developer's guide on how to test EmitC as a feature. For usage of EmitC, please refer to ttnn-standalone docs.

Prerequisites

  • Built ttmlir

  • Built ttrt

  • Saved system descriptor file:

    ttrt query --save-artifacts
    
  • Activated virtual environment:

    source env/activate
    

Generate EmitC tests and run it

  1. Generate flatbuffers and .cpp files for EmitC tests

    If you don't have SYSTEM_DESC_PATH environment variable exported, you can run:

    SYSTEM_DESC_PATH=/path/to/system_desc.ttsys llvm-lit -sv test/ttmlir/EmitC/TTNN
    

    Or if you have SYSTEM_DESC_PATH exported, you can omit it:

    llvm-lit -sv test/ttmlir/EmitC/TTNN
    
  2. Compile generated .cpp files to shared objects

    tools/ttnn-standalone/ci_compile_dylib.py
    
  3. Run flatbuffers + shared objects and compare results

    ttrt run --emitc build/test/ttmlir/EmitC/TTNN
    

Tools

Currently, there are a few primary tools that are part of the ttmlir project:

  • ttmlir-opt: The ttmlir optimizer driver. This tool is used to run the ttmlir compiler passes on a .mlir source files and is central to developing and testing the compiler.
  • ttmlir-translate: The ttmlir translation tool. This tool can convert from IR to external representation (and inverse). For example, IR in EmitC dialect can be converted into C++ code.
  • ttrt: This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files.
  • [ttir-builder] (./ttir-builder.md): This tool is for creating ttir operations. It provides support for those ops to be compiled into modules or directly to flatbuffer files.
  • tt-explorer: Visualizer tool for ttmlir-powered compiler results. Visualizes from emitted .mlir files to display compiled model, attributes, performance results, and provide a platform for human-driven overrides to gameify model tuning.
  • ttnn-standalone: This tool is used to run C++ TTNN code outside of the compiler environment.

ttmlir-opt

The ttmlir optimizer driver. This tool is used to run the ttmlir compiler passes on a .mlir source files and is central to developing and testing the compiler.

Simple Test

./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir
# Or
./build/bin/ttmlir-opt --ttir-to-ttmetal-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir

ttmlir-translate

The ttmlir-translate translation utility. Unlike ttmlir-opt tool which is used to run passes within the MLIR world, ttmlir-translate allows us to ingest something (e.g. code) into MLIR world, and also produce something (e.g. executable binary, or even code again) from MLIR.

Generate C++ code from MLIR

# First, let's run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-to-emitc-pipeline test/ttmlir/Dialect/TTNN/eltwise/binary/multiply/simple_multiply.mlir -o c.mlir

# Now run `ttmlir-translate` to produce C++ code
./build/bin/ttmlir-translate --mlir-to-cpp c.mlir

Bonus: These two commands can be piped, to avoid writing a mlir file to disk, like so:

./build/bin/ttmlir-opt --ttir-to-emitc-pipeline test/ttmlir/Dialect/TTNN/eltwise/binary/multiply/simple_multiply.mlir | ./build/bin/ttmlir-translate -mlir-to-cpp

Generate flatbuffer file from MLIR

# First run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/eltwise/binary/multiply/simple_multiply.mlir -o ttnn.mlir

# Now run `ttmlir-translate` to produce flatbuffer file
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn

ttrt

This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files. It enables the running of flatbuffer files without a front-end runtime.

Building

source env/activate
cmake --build build -- ttrt
ttrt --help

Building runtime mode

Add the following flags when building the compiler

-DTTMLIR_ENABLE_RUNTIME=ON

Building perf mode

Add the following flags when building the compiler

-DTTMLIR_ENABLE_RUNTIME=ON
-DTT_RUNTIME_ENABLE_PERF_TRACE=ON

LOGGER Levels

ttrt support logging at different logger levels. You will need to set env var TTRT_LOGGER_LEVEL. By default, it's set to INFO.

TTRT_LOGGER_LEVEL=INFO
TTRT_LOGGER_LEVEL=CRITICAL
TTRT_LOGGER_LEVEL=ERROR
TTRT_LOGGER_LEVEL=WARNING
TTRT_LOGGER_LEVEL=DEBUG

Installing ttrt as python whls

Everytime you build ttrt, it will create a whls file in build/runtime/tools/ttrt/build. Ex filename ttrt-0.0.235-cp310-cp310-linux_x86_64.whl. You can take this whls file and install it in any docker container and in any venv outside of ttmlir. After which, you can use all the following functionality as the same.

  1. Download whls
  2. Create a python venv
python -m venv ttrt_env
source ttrt_env/bin/activate
  1. Install whls (replace with your version of the whls)
pip install ttrt-0.0.235-cp310-cp310-linux_x86_64.whl

Generate a flatbuffer file from ttir-builder

ttir-builder is a tool for creating TTIR ops and converting them into MLIR modules, running passes to lower into backends, and translate to flatbuffers.

Generate a flatbuffer file from compiler

The compiler supports a pass to load a system descriptor to compile against. You can feed this pass into ttmlir-opt.

  1. Build ttmlir
  2. Build ttrt (see building section on this page)
  3. Generate ttsys file from the system you want to compile for using ttrt. This will create a system_desc.ttsys file under ttrt-artifacts folder.
ttrt query --save-artifacts
  1. Use ttmlir-opt tool in compiler to feed system descriptor. See the ttmlir-opt documentation for more information on how to generate .mlir files.
./build/bin/ttmlir-opt --tt-register-device="system-desc-path=/path/to/system_desc.ttsys" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
or (pipe path directly into ttir-to-ttnn-backend-pipeline)
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=/path/to/system_desc.ttsys" test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
  1. Use ttmlir-translate tool in compiler to generate the flatbuffer executable. See the ttmlir-translate documentation for more information on how to generate flatbuffer files.
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn
  1. Run your test cases using ttrt
ttrt run /path/to/out.ttnn

Generate flatbuffer files using llvm-lit

There are already existing .mlir test cases under test/ttmlir/Silicon. You can use llvm-lit tool to generate the corresponding ttnn and ttm files.

  1. Build ttmlir
  2. Build ttrt (see building section on this page)
  3. Generate ttsys file from the system you want to compile for using ttrt. This will create a system_desc.ttsys file under ttrt-artifacts folder.
ttrt query --save-artifacts
  1. Export this file in your environment using export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys. When llvm-lit is run, it will query this variable and generate the ttnn and ttm files using this system. Optionally, you can also provide this manually when running llvm-lit.
  2. Generate your test cases. This will generate all your ttnn and ttm files under build/test/ttmlir/Silicon. ttnn files have a .ttnn file extension and ttmetal files have a .ttm extension.
cmake --build build -- check-ttmlir
  1. (Optional) If you have a single .mlir file (or a directory of custom .mlir files) that you created using the compiler, and you want to generate the corresponding ttnn and ttm files for it, you can run llvm-lit standalone to the path of your .mlir file or directory of .mlir files to generate the flatbuffer executables. You will have to make sure you add in the correct llvm-lit configs into your .mlir file. See section on adding llvm-lit config options inside a .mlir file to create flatbuffer binaries for more info. You must also make sure your .mlir test is found within test/ttmlir/Silicon folder (and point lit to the build folder)!
llvm-lit -v ./build/test/ttmlir/Silicon
or
SYSTEM_DESC_PATH=/path/to/system_desc.ttsys llvm-lit -v ./build/test/ttmlir/Silicon
  1. Run your test cases using ttrt
ttrt run /path/to/test.ttnn
ttrt run /path/to/dir/of/flatbuffers

Adding llvm-lit config options inside a .mlir file to create flatbuffer binaries

Inside of your .mlir file, you can add certain config options that llvm-lit will use when running against that test case. For the purpose of generating flatbuffer executables, you can add --tt-register-device="system-desc-path=%system_desc_path%" which will tell llvm-lit to parse the system desc found from the environment flag set by export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys. You can also paste a custom path to a system desc file as well.

// RUN: ttmlir-opt --tt-register-device="system-desc-path=%system_desc_path%" --ttnn-layout --convert-ttir-to-ttnn %s  > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn

Adding new mlir test cases

You can copy your .mlir test file (with the appropriate llvm-lit config options for generating flatbuffer binaries) into test/ttmlir/Silicon. Then, follow generating flatbuffer files using llvm-lit to generate the executables to run!

Versioning

ttrt and flatbuffers have strict versioning check. When running a flatbuffer against ttrt, you have to make sure the flatbuffer was generated using the same version as ttrt (or vice versa). Major and Minor versions are manually set using github tags when releases are made. Patch versioning is the number of commits from the last major/minor tag.

vmajor.minor.patch

APIs

ttrt --help
ttrt read
ttrt run
ttrt query
ttrt perf
ttrt check

Command Line

There are different ways you can use the APIs under ttrt. The first is via the command line as follows. All artifacts are saved under ttrt-artifacts folder under TT_MLIR_HOME environment variable. By default, all logging is printed to the terminal. You can specify a log file to dump output to.

read

Read sections of a binary file

ttrt read --help
ttrt read --section mlir out.ttnn
ttrt read --section cpp out.ttnn
ttrt read --section version out.ttnn
ttrt read --section system_desc out.ttnn
ttrt read --section inputs out.ttnn
ttrt read --section outputs out.ttnn
ttrt read --section all out.ttnn
ttrt read --section all out.ttnn --clean-artifacts
ttrt read --section all out.ttnn --save-artifacts
ttrt read --section all /dir/of/flatbuffers
ttrt read system_desc.ttsys
ttrt read --section system_desc system_desc.ttsys
ttrt read system_desc.ttsys --log-file ttrt.log
ttrt read out.ttnn --save-artifacts --artifact-dir /path/to/some/dir
ttrt read out.ttnn --result-file result.json

run

Run a binary file or a directory of binary files Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt run --help
ttrt run out.ttnn
ttrt run out.ttnn --seed 0
ttrt run out.ttnn --init arange
ttrt run out.ttnn --identity
ttrt run out.ttnn --identity --rtol 1 --atol 1
ttrt run out.ttnn --clean-artifacts
ttrt run out.ttnn --save-artifacts
ttrt run out.ttnn --loops 10
ttrt run --program-index all out.ttnn
ttrt run --program-index 0 out.ttnn
ttrt run /dir/of/flatbuffers
ttrt run /dir/of/flatbuffers --loops 10
ttrt run /dir/of/flatbuffers --log-file ttrt.log
ttrt run out.ttnn --save-artifacts --artifact-dir /path/to/some/dir
ttrt run out.ttnn --load-kernels-from-disk
ttrt run out.ttnn --result-file result.json
ttrt run out.ttnn --disable-golden
ttrt run out.ttnn --save-golden-tensors
ttrt run out.ttnn --debugger
ttrt run out.ttnn --memory --save-artifacts
ttrt run out.ttnn --memory --check-memory-leak

query

Query the system to obtain the system desc file (optionally store it to disk) Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt query --help
ttrt query
ttrt query --quiet
ttrt query --save-artifacts
ttrt query --clean-artifacts
ttrt query --save-artifacts --log-file ttrt.log
ttrt query --save-artifacts --artifact-dir /path/to/some/dir
ttrt query --result-file result.json

perf

Run performance mode of a binary file or a directory of binary files Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON. Also need perf enabled build -DTT_RUNTIME_ENABLE_PERF_TRACE=ON. Note: You can collect host only related performance data via --host-only flag. By default, host and device side performance data are both collected. If the saving artifacts flag is provided, perf mode will dump the following files in the artifacts directory

ops_perf_results.csv : compiled op performance results
profile_log_device.csv : dump of all device side profiled results
tracy_ops_data.csv : op data results dumped in a readable format
tracy_ops_times.csv : op time results dumped in a readable format
tracy_profile_log_host.tracy : tracy profiled results file, this file can be fed into the tracy GUI
ttrt perf --help
ttrt perf out.ttnn
ttrt perf out.ttnn --clean-artifacts
ttrt perf out.ttnn --save-artifacts
ttrt perf out.ttnn --loops 10
ttrt perf --program-index all out.ttnn
ttrt perf --program-index 0 out.ttnn
ttrt perf --host-only out.ttnn
ttrt perf /dir/of/flatbuffers --host-only
ttrt perf /dir/of/flatbuffers --loops 10 --host-only
ttrt perf /dir/of/flatbuffers --log-file ttrt.log --host-only
ttrt perf --save-artifacts --artifact-dir /path/to/some/dir
ttrt perf out.ttnn --result-file result.json
ttrt run out.ttnn --memory

To use the Tracy GUI, run the following instructions on your macbook. You can upload your .tracy file into the GUI to view the profiled dumps.

git clone https://github.com/tenstorrent-metal/tracy.git
cd tracy/profiler/build/unix
make all
./Tracy-release

check

Check a binary file or a directory of binary files against a system desc (by default, uses the host machine) Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt check --help
ttrt check out.ttnn
ttrt check out.ttnn --system-desc /path/to/system_desc.ttsys
ttrt check out.ttnn --clean-artifacts
ttrt check out.ttnn --save-artifacts
ttrt check out.ttnn --log-file ttrt.log
ttrt check /dir/of/flatbuffers --system-desc /dir/of/system_desc
ttrt check --save-artifacts --artifact-dir /path/to/some/dir out.ttnn
ttrt check out.ttnn --result-file result.json

gdb

You can relaunch ttrt inside of gdb which can be useful for debugging C++ runtime components.

ttrt --gdb run ...
ttrt --gdb perf ...

ttrt as a python package

The other way to use the APIs under ttrt is importing it as a library. This allows the user to use it in custom scripts.

Import ttrt as a python package

from ttrt.common.api import API

Setup API and register all features

API.initialize_apis()

Setup arguments

You can specify certain arguments to pass to each API, or use the default arguments provided

args

This can be a dictionary of values to set inside your API instance. These are the same options as found via the command line. You can get the total list of support arguments via ttrt help command line. Any argument not provided will be set to the default.

custom_args = {}
custom_args["--clean-artifacts"] = True
query_instance = API.Query(args=custom_args)

logging

You can specify a specific logging module you want to set inside your API instance. The rationale behind this is to support different instances of different APIs, all being able to be logged to a different file. You can also customize the level of detail your log file contains.

from ttrt.common.util import Logger
import os

os.environ["LOGGER_LEVEL"] = "DEBUG"
log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)
read_instance = API.Read(logger=custom_logger)

To set logging level through the terminal, use environment variable TTRT_LOGGER_LEVEL.

   export TTRT_LOGGER_LEVEL=DEBUG

artifacts

You can specify a specific artifacts directory to store all the generate metadata during the execution of any API run. This allows you to specify different artifact directories if you wish for different instances of APIs.

from ttrt.common.util import Artifacts

log_file_name = "some_file_name.log"
artifacts_folder_path = "/opt/folder"
custom_logger = Logger(log_file_name)
custom_artifacts = Artifacts(logger=custom_logger, artifacts_folder_path=artifacts_folder_path)
run_instance = API.Run(artifacts=custom_artifacts)

Execute API

Once all the arguments are setup, you can run your API instance with all your provided arguments. Note, APIs are stateless. Thus, subsequent calls to the same API instance will not preserve previous call artifacts. You can generate a new artifacts directory for subsequent runs if you wish to call the APIs multiple times, for example.

result_code, results = query_instance()
result_code, results = read_instance()
result_code, results = run_instance()

Putting it all together

You can do interesting stuff when combining all the above features into your python script

from ttrt.common.api import API
from ttrt.common.util import Logger
from ttrt.common.util import Artifacts

API.initialize_apis()

custom_args = {}
custom_args["--clean-artifacts"] = True
custom_args["--save-artifacts"] = True
custom_args["--loops"] = 10
custom_args["--init"] = "randn"
custom_args["binary"] = "/path/to/subtract.ttnn"

log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)

artifacts_folder_path = "/opt/folder"
custom_artifacts = Artifacts(logger=custom_logger, artifacts_folder_path=artifacts_folder_path)

run_instance = API.Run(args=custom_args, logger=custom_logger, artifacts=custom_artifacts)
result_code, results = run_instance()

Bonus Section: Extending runtime to other FE's

MLIR Runtime exposes a feature to register python callback functions. Any two python fuctions can be provided - the first function will be executed before every op in MLIR Runtime, the second after every op. The following steps describe how to extend your application to register python functions.

  1. Pybind DebugHooks C++ class, specifically tt::runtime::debug::Hooks::get. See runtime/tools/ttrt/ttrt/runtime/module.cpp for an example of how TTRT pybinds it.
tt::runtime::debug::Hooks
tt::runtime::debug::Hooks::get
  1. Register callback functions in your python script. The following is registering two golden python functions. Assume the Debug Hooks get function has been pybinded to ttrt.runtime.DebugHooks.get
callback_env = ttrt.runtime.DebugHooks.get(preOpGolden, postOpGolden)
  1. The callback function has a particular function signature, which looks like the following
def preOpGolden(binary, program_context, op_context):

binary: reference to the binary you are currently running, ttrt.binary Binary object program_context: reference to the program currently running, ttrt.runtime ProgramContext object op_context: reference to the op that is currently running, ttrt.runtime OpContext object

  1. Each of these parameters has certain APIs exposed which can be called within the callback functions
loc = ttrt.runtime.get_op_loc_info(op_context) : get the location of the op as a string which is used as the key when indexing the golden tensors stored in the flatbuffer
op_debug_str = ttrt.runtime.get_op_debug_str(op_context) : get the op debug str (contains op metadata inculding op type, attributes, input tensor shapes and dtypes, memref with layout and buffer type, and loc)
op_golden_tensor = ttrt.runtime.get_debug_info_golden(binary, loc) : get the golden tensor from the binary as a ttrt.binary GoldenTensor object
op_output_tensor = ttrt.runtime.get_op_output_tensor(op_context, program_context) : get the currently running output tensor from device as a ttrt.runtime Tensor object, if this is called in a preOp function or the op doesn't output a tensor, an empty tensor will be returned.
  1. A potential application for this callback function is implementing a golden callback. TTRT achieves this by first storing the golden data within the flatbuffer binary. This embedding is done through ttir-builder. See runtime/tools/ttrt/ttrt/common/callback.py for how ttrt implements the golden callback function.
std::unordered_map<std::string, mlir::tt::GoldenTensor> goldenMap
mlir::tt::ttnn::translateTTNNToFlatbuffer(moduleOp, file, goldenMap)

Note: ttrt is not needed to implement this callback feature. It aims to provide an example of how this callback feature can be implemented for golden application.

FAQ

Flatbuffer version does not match ttrt version!

  • ttrt and flatbuffer have strict versioning that is checked during ttrt execution. You will have to generate a flatbuffer using the same version of ttrt (or vice versa). This mean you might have to build on the same branch on which the flatbuffer was generated or regenerate the flatbuffer using your current build.

System desc does not match flatbuffer!

  • flatbuffers are compiled using a specific system desc (or default values if no system desc is provided). During runtime, the flatbuffer system desc is checked against the current system to ensure the system being run on supports the flatbuffer that was compiled. If you get this error, you will have to regenerate the flatbuffer using the system you want to run on. See generate a flatbuffer file from compiler section on how to do this.

I just want to test and push my commit! What do I do!

  • follow these steps (on both n150 and n300)
1. Build ttmlir (sample instructions - subject to change)
source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17 -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DTTMLIR_ENABLE_RUNTIME=ON -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
cmake --build build

2. Build ttrt (sample instructions - subject to change)
cmake --build build -- ttrt

3. Query system
ttrt query --save-artifacts

4. Export system desc file
export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys (path dumped in previous command)

5. Generate test cases
cmake --build build -- check-ttmlir

6. Run test cases
ttrt run build/test/ttmlir/Silicon

7. (Optional) Run perf test cases
ttrt perf build/test/ttmlir/Silicon

TTRT yields an ambiguous segmentation fault when I try to read/run a .ttnn file!

The ttrt toolchain has specific behaviors and requirements that can lead to build and runtime issues, particularly when dealing with version mismatches or out-of-sync dependencies.

Version Mismatch Due to Local Commits

The ttrt toolchain verifies whether the current system configuration matches the model’s compilation environment. This verification involves tracking the number of commits since the last synchronization. When local commits are made in your branch, it may trigger a version mismatch between the compiled model and the current environment. This mismatch may not be handled properly by the runtime (rt), leading to potential issues.

To resolve issues stemming from these synchronization problems, follow this workflow:

  1. Incremental build
  # make some changes
  # commit
  cmake --build build
  cmake --build build -- ttrt
  # note you need to generate system_desc and flatbuffer again once you do this

This incremental build should be sufficient. If it does not resolve the error, please file an issue and proceed with the following steps for now.

  1. Clear the existing build and dependencies:
   rm -rf build third_party/tt-metal

This ensures that all previous build artifacts and dependencies are removed, preventing conflicts or stale files from affecting the new build.

  1. Rebuild from scratch: After clearing the build directories, rebuild the project from the ground up. This ensures that the build process incorporates all the necessary components without any remnants of previous builds. Build Instructions

  2. Switch build configurations: If switching from a Debug to a Release build (or vice versa), ensure that you clean the build environment before transitioning. This avoids inconsistencies between build configurations and potential issues with optimization levels or debugging symbols.

  3. Re-acquire the IRD: By relinquishing and re-acquiring the IRD, you ensure that the correct toolchain is used for the new build. This step ensures synchronization between the model and the toolchain.

  4. Enable Debug Logging for tt-metal: To gain more insight into potential issues, enable debugging by setting the TT_METAL_LOGGER_LEVEL to DEBUG. This will provide detailed logs, which can help in troubleshooting build or runtime issues.

   export TT_METAL_LOGGER_LEVEL=DEBUG

ttir-builder

ttir-builder is a tool for creating TTIR operations. It provides support for MLIR modules to be generated from user-constructed ops, lowered into TTNN or TTMetal backends, and finally translated into executable flatbuffers. Or you can do all three at once!

Getting started and building

Build ttmlir.

TTIRBuilder is a builder class providing the API for creating TTIR ops. The package ttir_builder contains everything needed to create ops for a TTIRBuilder object. ttir_builder.utils contains the APIs for wrapping op-creating-functions into MLIR modules and flatbuffers files.

from ttir_builder import TTIRBuilder, Operand, Shape
from ttir_builder.utils import compile_to_flatbuffer

For the full set of supported ops, see tools/ttir-builder/builder.py. For more detailed information on available APIs, see tools/ttir-builder/builder.py and tools/ttir-builder/utils.py.

Creating a TTIR module

build_mlir_module defines an MLIR module specified as a python function. It wraps test_fn in a MLIR FuncOp then wraps that in an MLIR module, and finally ties arguments of that FuncOp to test function inputs. It will instantiate and pass a TTIRBuilder object as the last argument of test_fn.

def build_mlir_module(
    test_fn: Callable,
    inputs_shapes: List[Shape],
    inputs_types: Optional[List[Union[torch.dtype, TypeInfo]]] = None,
    mesh_shape: Optional[Tuple[int, int]] = None,
    module_dump: bool = False,
    base: Optional[str] = None,
    output_root: str = ".",
)

Example

from ttir_builder.utils import build_mlir_module
from ttir_builder import Operand, TTIRBuilder

shapes = [(32, 32), (32, 32), (32, 32)]

def model(in0: Operand, in1: Operand, in2: Operand, builder: TTIRBuilder):
    add_0 = builder.add(in0, in1)
    multiply_1 = builder.multiply(in1, add_0)
    return builder.multiply(multiply_1, in2)

module, builder = build_mlir_module(model, shapes)

Returns

An MLIR module containing an MLIR op graph defined by test_fn and the TTIRBuilder object used to create it

module {
  func.func @model(%arg0: tensor<32x32xf32>, %arg1: tensor<32x32xf32>, %arg2: tensor<32x32xf32>) -> tensor<32x32xf32> {
    %0 = ttir.empty() : tensor<32x32xf32>
    %1 = "ttir.add"(%arg0, %arg1, %0) : (tensor<32x32xf32>, tensor<32x32xf32>, tensor<32x32xf32>) -> tensor<32x32xf32>
    %2 = ttir.empty() : tensor<32x32xf32>
    %3 = "ttir.multiply"(%arg1, %1, %2) : (tensor<32x32xf32>, tensor<32x32xf32>, tensor<32x32xf32>) -> tensor<32x32xf32>
    %4 = ttir.empty() : tensor<32x32xf32>
    %5 = "ttir.multiply"(%3, %arg2, %4) : (tensor<32x32xf32>, tensor<32x32xf32>, tensor<32x32xf32>) -> tensor<32x32xf32>
    return %5 : tensor<32x32xf32>
  }
}

Running a pipeline

run_pipeline runs a pass on the TTIR module to lower it into a backend, using pipeline_fn. You can pass pipeline_fn in as one of the following: ttir_to_ttnn_backend_pipeline, ttir_to_ttmetal_backend_pipeline (both found in ttmlir.passes), or a custom pipeline built with create_custom_pipeline_fn. The default if none is provided is the TTNN pipeline.

def run_pipeline(
    module,
    pipeline_fn: Callable = ttir_to_ttnn_backend_pipeline,
    pipeline_options: List[str] = None,
    dump_to_file: bool = True,
    output_file_name: str = "test.mlir",
    system_desc_path: Optional[str] = None,
    mesh_shape: Optional[Tuple[int, int]] = None,
    argument_types_string: Optional[str] = None,
)

TTNN example

Let's expand on our previous example

from ttir_builder.utils import build_mlir_module, run_pipeline
from ttir_builder import Operand, TTIRBuilder
from ttmlir.passes import ttir_to_ttnn_backend_pipeline

shapes = [(32, 32), (32, 32), (32, 32)]

def model(in0: Operand, in1: Operand, in2: Operand, builder: TTIRBuilder):
    add_0 = builder.add(in0, in1)
    multiply_1 = builder.multiply(in1, add_0)
    return builder.multiply(multiply_1, in2)

module, builder = build_mlir_module(model, shapes)
ttnn_module = run_pipeline(module, ttir_to_ttnn_backend_pipeline)

Returns

An MLIR module lowered into TTNN

#dram = #ttnn.buffer_type<dram>
#system_desc = #tt.system_desc<[{role = host, target_triple = "x86_64-pc-linux"}], [{arch = <wormhole_b0>, grid = 8x8, coord_translation_offsets = 18x18, l1_size = 1499136, num_dram_channels = 12, dram_channel_size = 1073741824, noc_l1_address_align_bytes = 16, pcie_address_align_bytes = 32, noc_dram_address_align_bytes = 32, l1_unreserved_base = 97248, erisc_l1_unreserved_base = 69632, dram_unreserved_base = 32, dram_unreserved_end = 1073158336, physical_helper_cores = {dram = [ 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7,  0x8,  0x9,  0x10,  0x11] eth_inactive = [ 16x18,  16x19,  16x20,  16x21,  16x22,  16x23,  16x24,  16x25,  17x19,  17x20,  17x22,  17x23,  17x24]}, supported_data_types = [<f32>, <f16>, <bf16>, <bfp_f8>, <bfp_bf8>, <bfp_f4>, <bfp_bf4>, <bfp_f2>, <bfp_bf2>, <u32>, <u16>, <u8>, <si32>], supported_tile_sizes = [ 4x16,  16x16,  32x16,  4x32,  16x32,  32x32], num_cbs = 32, num_compute_threads = 1, num_datamovement_threads = 2}], [0], [3 : i32], [ 0x0x0x0]>
#ttnn_layout = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<1x1x!tt.tile<32x32, f32>, #dram>, <interleaved>>
module {
  tt.device_module {
    builtin.module attributes {tt.system_desc = #system_desc} {
      tt.device @default_device = <workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>, l1Map = (d0, d1, d2)[s0] -> (0, d0, d1, d2 + s0), dramMap = (d0, d1, d2)[s0, s1, s2, s3, s4, s5] -> (0, 0, (((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) floordiv s4) mod 12, ((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) floordiv (s4 * 12) + ((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) mod s4 + s5), meshShape = , chipIds = [0]>
      func.func @model(%arg0: tensor<32x32xf32, #ttnn_layout>, %arg1: tensor<32x32xf32, #ttnn_layout>, %arg2: tensor<32x32xf32, #ttnn_layout>) -> tensor<32x32xf32, #ttnn_layout> {
        %0 = "ttnn.abs"(%arg0) : (tensor<32x32xf32, #ttnn_layout>) -> tensor<32x32xf32, #ttnn_layout>
        "ttnn.deallocate"(%arg0) <{force = false}> : (tensor<32x32xf32, #ttnn_layout>) -> ()
        %1 = "ttnn.multiply"(%arg1, %0) : (tensor<32x32xf32, #ttnn_layout>, tensor<32x32xf32, #ttnn_layout>) -> tensor<32x32xf32, #ttnn_layout>
        "ttnn.deallocate"(%0) <{force = false}> : (tensor<32x32xf32, #ttnn_layout>) -> ()
        "ttnn.deallocate"(%arg1) <{force = false}> : (tensor<32x32xf32, #ttnn_layout>) -> ()
        %2 = "ttnn.multiply"(%1, %arg2) : (tensor<32x32xf32, #ttnn_layout>, tensor<32x32xf32, #ttnn_layout>) -> tensor<32x32xf32, #ttnn_layout>
        "ttnn.deallocate"(%1) <{force = false}> : (tensor<32x32xf32, #ttnn_layout>) -> ()
        "ttnn.deallocate"(%arg2) <{force = false}> : (tensor<32x32xf32, #ttnn_layout>) -> ()
        return %2 : tensor<32x32xf32, #ttnn_layout>
      }
    }
  }
}

TTMetal example

Let's use the same code for TTMetal that was used in the TTNN example but change the pipeline_fn to ttir_to_ttmetal_backend_pipeline. Only one or the other can be run on a module since run_pipeline modifies the module in place. Note that while all TTIR ops supported by builder can be lowered to TTNN, not all can be lowered to TTMetal yet. Adding documentation to specify what ops can be lowered to TTMetal is in the works.

from ttmlir.passes import ttir_to_ttmetal_backend_pipeline
ttmetal_module = run_pipeline(module, ttir_to_ttmetal_backend_pipeline)

Returns

An MLIR module lowered into TTMetal

#l1 = #tt.memory_space<l1>
#system_desc = #tt.system_desc<[{role = host, target_triple = "x86_64-pc-linux-gnu"}], [{arch = <wormhole_b0>, grid = 8x8, coord_translation_offsets = 18x18, l1_size = 1499136, num_dram_channels = 12, dram_channel_size = 1073741824, noc_l1_address_align_bytes = 16, pcie_address_align_bytes = 32, noc_dram_address_align_bytes = 32, l1_unreserved_base = 1024, erisc_l1_unreserved_base = 1024, dram_unreserved_base = 1024, dram_unreserved_end = 1073741824, physical_helper_cores = {dram = [ 8x0,  9x0,  10x0,  8x1,  9x1,  10x1,  8x2,  9x2,  10x2,  8x3,  9x3,  10x3]}, supported_data_types = [<f32>, <f16>, <bf16>, <bfp_f8>, <bfp_bf8>, <bfp_f4>, <bfp_bf4>, <bfp_f2>, <bfp_bf2>, <u32>, <u16>, <u8>, <si32>], supported_tile_sizes = [ 4x16,  16x16,  32x16,  4x32,  16x32,  32x32], num_cbs = 32, num_compute_threads = 1, num_datamovement_threads = 2}], [0], [3 : i32], [ 0x0x0x0]>
module {
  tt.device_module {
    builtin.module attributes {tt.system_desc = #system_desc} {
      tt.device @default_device = <workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>, l1Map = (d0, d1, d2)[s0] -> (0, d0, d1, d2 + s0), dramMap = (d0, d1, d2)[s0, s1, s2, s3, s4, s5] -> (0, 0, (((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) floordiv s4) mod 12, ((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) floordiv (s4 * 12) + ((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) mod s4 + s5), meshShape = , chipIds = [0]>
      func.func @model(%arg0: memref<32x32xf32>, %arg1: memref<32x32xf32>, %arg2: memref<32x32xf32>) -> memref<32x32xf32> {
        %0 = "ttmetal.create_buffer"() <{address = 9216 : i64}> : () -> memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>
        %1 = "ttmetal.create_buffer"() <{address = 1024 : i64}> : () -> memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>
        "ttmetal.enqueue_write_buffer"(%arg0, %1) : (memref<32x32xf32>, memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>) -> ()
        "ttmetal.enqueue_program"(%1, %0, %1, %0) <{cb_ports = array<i64: 0, 1>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel0, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, noc0>, #ttmetal.compute_config<@compute_kernel1, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 2, 2>}> : (memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%1) : (memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>) -> ()
        %2 = "ttmetal.create_buffer"() <{address = 1024 : i64}> : () -> memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>
        %3 = "ttmetal.create_buffer"() <{address = 5120 : i64}> : () -> memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>
        "ttmetal.enqueue_write_buffer"(%arg1, %3) : (memref<32x32xf32>, memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>) -> ()
        "ttmetal.enqueue_program"(%3, %2, %3, %2) <{cb_ports = array<i64: 0, 1>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel2, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, noc0>, #ttmetal.compute_config<@compute_kernel3, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 2, 2>}> : (memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%3) : (memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>) -> ()
        %4 = "ttmetal.create_buffer"() <{address = 13312 : i64}> : () -> memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>
        "ttmetal.enqueue_program"(%0, %2, %4, %0, %2, %4) <{cb_ports = array<i64: 0, 1, 2>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel4, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc0>, #ttmetal.noc_config<@datamovement_kernel5, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc1>, #ttmetal.compute_config<@compute_kernel6, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 3, 3>}> : (memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%0) : (memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%2) : (memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        %5 = "ttmetal.create_buffer"() <{address = 1024 : i64}> : () -> memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>
        %6 = "ttmetal.create_buffer"() <{address = 5120 : i64}> : () -> memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>
        "ttmetal.enqueue_write_buffer"(%arg1, %6) : (memref<32x32xf32>, memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>) -> ()
        "ttmetal.enqueue_program"(%6, %5, %6, %5) <{cb_ports = array<i64: 0, 1>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel7, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, noc0>, #ttmetal.compute_config<@compute_kernel8, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 2, 2>}> : (memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%6) : (memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>) -> ()
        %7 = "ttmetal.create_buffer"() <{address = 17408 : i64}> : () -> memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>
        "ttmetal.enqueue_program"(%5, %4, %7, %5, %4, %7) <{cb_ports = array<i64: 0, 1, 2>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel9, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc0>, #ttmetal.noc_config<@datamovement_kernel10, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc1>, #ttmetal.compute_config<@compute_kernel11, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 3, 3>}> : (memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%5) : (memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%4) : (memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        %8 = "ttmetal.create_buffer"() <{address = 9216 : i64}> : () -> memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>
        %9 = "ttmetal.create_buffer"() <{address = 1024 : i64}> : () -> memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>
        "ttmetal.enqueue_write_buffer"(%arg2, %9) : (memref<32x32xf32>, memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>) -> ()
        "ttmetal.enqueue_program"(%9, %8, %9, %8) <{cb_ports = array<i64: 0, 1>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel12, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, noc0>, #ttmetal.compute_config<@compute_kernel13, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 2, 2>}> : (memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%9) : (memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>) -> ()
        %10 = "ttmetal.create_buffer"() <{address = 5120 : i64}> : () -> memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>
        "ttmetal.enqueue_program"(%7, %8, %10, %7, %8, %10) <{cb_ports = array<i64: 0, 1, 2>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel14, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc0>, #ttmetal.noc_config<@datamovement_kernel15, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc1>, #ttmetal.compute_config<@compute_kernel16, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 3, 3>}> : (memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%8) : (memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%7) : (memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        %alloc = memref.alloc() : memref<32x32xf32>
        %11 = "ttmetal.create_buffer"() <{address = 1024 : i64}> : () -> memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>
        "ttmetal.enqueue_program"(%10, %11, %10, %11) <{cb_ports = array<i64: 0, 1>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel17, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, noc0>, #ttmetal.compute_config<@compute_kernel18, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 2, 2>}> : (memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>, memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%10) : (memref<1x1x1x1x!tt.tile<32x32, f32>, #tt.shard<4096x4096>, #l1>) -> ()
        "ttmetal.enqueue_read_buffer"(%11, %alloc) : (memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>, memref<32x32xf32>) -> ()
        "ttmetal.finish"() : () -> ()
        "ttmetal.deallocate_buffer"(%11) : (memref<1x1x32x32xf32, #tt.shard<128x4>, #l1>) -> ()
        return %alloc : memref<32x32xf32>
      }
      func.func private @datamovement_kernel0() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel1() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %2 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "tilize_init"(%1, %0, %2) : (!emitc.opaque<"::tt::CB">, i32, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "experimental::tilize_block"(%1, %2, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, i32, i32) -> ()
        emitc.call_opaque "cb_push_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel2() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel3() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %2 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "tilize_init"(%1, %0, %2) : (!emitc.opaque<"::tt::CB">, i32, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "experimental::tilize_block"(%1, %2, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, i32, i32) -> ()
        emitc.call_opaque "cb_push_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel4() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel5() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel6() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 0 : index}> : () -> !emitc.size_t
        %1 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        emitc.call_opaque "tile_regs_acquire"() : () -> ()
        %2 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %3 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        %4 = emitc.literal "get_compile_time_arg_val(2)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "binary_op_init_common"(%2, %3, %4) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "add_tiles_init"(%2, %3) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "add_tiles"(%2, %3, %0, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.size_t, !emitc.size_t, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_commit"() : () -> ()
        emitc.call_opaque "tile_regs_wait"() : () -> ()
        emitc.call_opaque "pack_tile"(%0, %4, %0) {template_args = [true]} : (!emitc.size_t, !emitc.opaque<"::tt::CB">, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_release"() : () -> ()
        emitc.call_opaque "cb_push_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel7() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel8() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %2 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "tilize_init"(%1, %0, %2) : (!emitc.opaque<"::tt::CB">, i32, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "experimental::tilize_block"(%1, %2, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, i32, i32) -> ()
        emitc.call_opaque "cb_push_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel9() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel10() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel11() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 0 : index}> : () -> !emitc.size_t
        %1 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        emitc.call_opaque "tile_regs_acquire"() : () -> ()
        %2 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %3 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        %4 = emitc.literal "get_compile_time_arg_val(2)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "binary_op_init_common"(%2, %3, %4) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "mul_tiles_init"(%2, %3) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "mul_tiles"(%2, %3, %0, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.size_t, !emitc.size_t, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_commit"() : () -> ()
        emitc.call_opaque "tile_regs_wait"() : () -> ()
        emitc.call_opaque "pack_tile"(%0, %4, %0) {template_args = [true]} : (!emitc.size_t, !emitc.opaque<"::tt::CB">, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_release"() : () -> ()
        emitc.call_opaque "cb_push_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel12() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel13() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %2 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "tilize_init"(%1, %0, %2) : (!emitc.opaque<"::tt::CB">, i32, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "experimental::tilize_block"(%1, %2, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, i32, i32) -> ()
        emitc.call_opaque "cb_push_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel14() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel15() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel16() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 0 : index}> : () -> !emitc.size_t
        %1 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        emitc.call_opaque "tile_regs_acquire"() : () -> ()
        %2 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %3 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        %4 = emitc.literal "get_compile_time_arg_val(2)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "binary_op_init_common"(%2, %3, %4) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "mul_tiles_init"(%2, %3) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "mul_tiles"(%2, %3, %0, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.size_t, !emitc.size_t, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_commit"() : () -> ()
        emitc.call_opaque "tile_regs_wait"() : () -> ()
        emitc.call_opaque "pack_tile"(%0, %4, %0) {template_args = [true]} : (!emitc.size_t, !emitc.opaque<"::tt::CB">, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_release"() : () -> ()
        emitc.call_opaque "cb_push_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel17() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel18() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %2 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "untilize_init"(%1, %2) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "experimental::untilize_block"(%1, %2, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, i32, i32) -> ()
        emitc.call_opaque "cb_push_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
    }
  }
}

Compiling into flatbuffer

compile_to_flatbuffer compiles a TTIRBuilder function fn straight to flatbuffer. This decorator is mainly a wrapper around the following functions, with each next function called on the output of the last: build_mlir_module, run_pipeline, and ttnn_to_flatbuffer_file or ttmetal_to_flatbuffer_file as dictated by the target parameter.

def compile_to_flatbuffer(
    fn: Callable,
    inputs_shapes: List[Shape],
    inputs_types: Optional[List[Union[torch.dtype, TypeInfo]]] = None,
    system_desc_path: str = "ttrt-artifacts/system_desc.ttsys",
    test_base: str = "test",
    output_root: str = ".",
    target: Literal["ttnn", "ttmetal"] = "ttnn",
    mesh_shape: Optional[Tuple[int, int]] = None,
    module_dump: bool = True,
    argument_types_string: Optional[str] = None,
    custom_pipeline: Union[Callable, str] = None,
    pipeline_options: List[str] = None,
)

No flatbuffer is printed or returned. It's only written to a file because it is created as an unsupported text encoding.

TTNN example

Let's use our previous model function.

from ttir_builder.utils import compile_to_flatbuffer
from ttir_builder import Operand, TTIRBuilder

shapes = [(32, 32), (32, 32), (32, 32)]

def model(in0: Operand, in1: Operand, in2: Operand, builder: TTIRBuilder):
    add_0 = builder.add(in0, in1)
    multiply_1 = builder.multiply(in1, add_0)
    return builder.multiply(multiply_1, in2)

compile_to_flatbuffer(
    model,
    shapes,
    target="ttnn",
)

TTMetal example

Let's once again use the same code for TTMetal that was used in the TTNN example but change the target to "ttmetal". Just as with run_pipeline, only one or the other can be run on a module since compile_to_flatbuffer modifies the module in place.

compile_to_flatbuffer(
    model,
    shapes,
    target="ttmetal",
)

Integrating with other tools

Alternatives for file creation

  1. The ttmlir-opt tool runs a compiler pass on an .mlir file.
  2. The ttmlir-translate can generate a flatbuffer from an .mlir file.
  3. llvm-lit can also be used to generate a flatbuffer from an existing .mlir file.

Running models

ttrt

ttrt is intended to be a swiss army knife for working with flatbuffers.

tt-explorer

tt-explorer is a visualizer tool for ttmlir-powered compiler results.

ttnn-standalone

ttnn-standalone is a post-compile tuning/debugging tool.

llvm-lit

llvm-lit can also be used for MLIR testing.

Golden mode

Golden dataclass

TTIRBuilder provides support to code golden tensors into flatbuffers which will be used for comparison with TT device output in ttrt runtime. Golden is the dataclass used to store information about a golden tensor. Each TTIR op should have a matching PyTorch op (or golden function built from PyTorch ops) which should perform exactly the same operation, generating the same outputs given the same inputs. You can use TTIRBuilder helper functions to store input, intermediate, and output tensors within the flatbuffer. Input and output goldens are mapped with keys "input_" and "output_" followed by a tensor index: input_0. Intermediate output tensors

GoldenCheckLevel Enum

TTIRBuilder stores an instance of the class GoldenCheckLevel(Enum) that dictates golden handling. It defaults to GoldenCheckLevel.OP_LEVEL. The exception is that TTIRBuilder CCL ops force the golden level to be set to GRAPH_LEVEL.

DISABLED : do not store goldens
OP_LEVEL : check every single op level goldens
GRAPH_LEVEL : check graph level goldens only

Check and set GoldenCheckLevel with TTIRBuilder APIs.

from ttir_builder import TTIRBuilder, Operand, GoldenCheckLevel

def model(in0: Operand, in1: Operand, in2: Operand, builder: TTIRBuilder):
    add_0 = builder.add(in0, in1)
    multiply_1 = builder.multiply(in1, add_0)
    builder.golden_check_level = GoldenCheckLevel.GRAPH_LEVEL
    return builder.multiply(multiply_1, in2)

Getting golden data

Unless otherwise specified in the GoldenCheckLevel, all input and output tensors will generate and store a golden in TTIRBuilder as a Golden type. The TTIRBuilder class has an API to print stored goldens if you want access to the data they contain: print_goldens(self).

Golden tensor:
tensor([[ 4.0450e+00,  1.4274e+00,  5.9156e-01,  ..., -5.9834e-01,
         -1.1830e-01,  1.2837e-01],
        [ 2.3788e+00,  2.9242e-03, -5.2838e-02,  ...,  1.8294e+00,
          5.0348e+00,  9.7179e-01],
        [ 1.5168e-02,  1.0577e-01, -3.0682e-01,  ...,  6.7212e-01,
          9.4523e-02,  5.3765e+00],
        ...,
        [ 1.4241e-01,  1.1838e+00, -1.0601e+00,  ...,  4.9099e-01,
          4.2267e+00,  4.0610e-01],
        [ 5.6630e-01, -1.3068e-01, -1.7771e-01,  ...,  2.3862e+00,
          3.9376e-01,  7.3140e-01],
        [ 4.2420e+00,  1.7006e-01, -3.4861e-01,  ...,  1.1471e-01,
          1.6189e+00, -6.9106e-01]])

The TTIRBuilder API get_golden_map(self) is used to export golden data for flatbuffer construction. It returns a dictionary of golden tensor names and GoldenTensor objects. Printing that map will look something like this:

{'input_0': <ttmlir._mlir_libs._ttmlir.passes.GoldenTensor object at 0x7f77c70fa0d0>, 'input_1': <ttmlir._mlir_libs._ttmlir.passes.GoldenTensor object at 0x7f77c70fa160>, 'input_2': <ttmlir._mlir_libs._ttmlir.passes.GoldenTensor object at 0x7f77c6fc9500>, 'output_0': <ttmlir._mlir_libs._ttmlir.passes.GoldenTensor object at 0x7f77c6fc9590>}

To get info from a GoldenTensor object, use the attributes supported by ttmlir.passes: name, shape, strides, dtype, data.

from ttmlir.passes import GoldenTensor

Setting golden data

Use TTIRBuilder API set_graph_input_output to set your own input and output golden tensors using PyTorch tensors.

set_graph_input_output(
        self,
        inputs: List[torch.Tensor],
        outputs: Optional[List[torch.Tensor]] = None,
        override: bool = False,
    )
import torch

input_0 = torch.ones((32, 32))
output_0 = torch.zeros((32, 32))
builder.set_graph_input_output([input_0], [output_0], True)

Running flatbuffer with golden data in ttrt

Running flatbuffers in ttrt requires additional building and setting up the environment. Run these commands before creating MLIR modules or flatbuffers so the system description in the flatbuffers match your device.

cmake --build build -- ttrt
ttrt query --save-artifacts
export SYSTEM_DESC_PATH=$(pwd)/ttrt-artifacts/system_desc.ttsys

Set environment variable TTRT_LOGGER_LEVEL to DEBUG so ttrt logs golden comparison results and prints graph level golden tensors.

export TTRT_LOGGER_LEVEL=DEBUG

Finally run ttrt. Our example flatbuffer file (since we didn't specify otherwise) defaulted to file path ./ttnn/test_ttnn.mlir.ttnn. --log-file ttrt.log and --save-golden-tensors are both optional flags. They ensure that all golden data produced by the ttrt run gets written to files.

ttrt run ttnn/test_ttnn.mlir --log-file ttrt.log --save-golden-tensors

Golden callbacks

The ttrt documentation contains a section on the callback function feature. Callback functions run between each op execution during runtime and contain op level golden analysis. They are also customizable and provide the flexibility for you to get creative with you golden usage.

Adding a new op to ttir-builder

ttir-builder is designed to only create ops supported in TTIR. At the moment, most but not all ops are supported, and new ops are still occasionally added to TTIR. Creating ttir-builder support for an op entails writing a function in tools/ttir-builder/builder.py that will create the op and its golden counterpart.

TTIR op factories

All ops are created when their relevant information is run through the op_proxy function which provides a general interface for proxy-ing and creating ops.

def op_proxy(
    self,
    op_golden_function: Callable,
    op_ttir_function: Callable,
    inputs: List[Operand],
    unit_attrs: List[str] = None,
    organize_ttir_args: Optional[Callable] = None,
    organize_golden_args: Optional[Callable] = None,
    output_shape: Optional[Shape] = None,
    output_type: Optional[Type] = None,
    output_create_fn: Optional[Callable] = None,
    golden_kwargs: dict = {},
    ttir_kwargs: dict = {},
)

Eltwise ops require less specialized handling and call op_proxy through eltwise_proxy.

def eltwise_proxy(
    self,
    op_golden_function: Callable,
    op_ttir_function: Callable,
    inputs: List[Operand],
    unit_attrs: List[str] = None,
)

CCL ops require GoldenCheckLevel to be set to GRAPH_LEVEL and integrate that into their own proxy function.

def ccl_proxy(
    self,
    op_golden_function: Callable,
    op_ttir_function: Callable,
    inputs: List[Operand],
    kwargs: dict = {},
)

Golden functions

Setting the various inputs, outputs, arguments, shapes, and types are all fairly straightforward. Find the TTIR op in include/ttmlir/Dialect/TTIR/IR/TTIROps.td and replicate the pertinents. If there is necessary information that is not included, you may have to take it upon yourself to do some detective work and trial and error. The tricky part can be the finding or writing a golden function. It must perform exactly the same operation as the TTIR op and be written using PyTorch operations.

tt-explorer

Welcome to the tt-explorer wiki! The Wiki will serve as a source for documentation, examples, and general knowledge related to the TT-MLIR visualization project. The sidebar will provide navigation to relevant pages. If this is your first time hearing about the project, take a look at Project Architecture for an in-depth introduction to the tool and motivations behind it :)

Quick Start

TT-Explorer comes packaged as a tool in the tt-mlir repo.

  1. Run source env/activate to be in tt-mlir virtualenv for the following steps
  2. Ensure tt-mlir is built with atleast these flags:
    • -DTT_RUNTIME_ENABLE_PERF_TRACE=ON -DTTMLIR_ENABLE_RUNTIME=ON -DTT_RUNTIME_DEBUG=ON
  3. Build explorer target in tt-mlir using cmake --build build -- explorer
  4. Run tt-explorer in terminal to start tt-explorer instance. (Refer to CLI section in API for specifics)
  5. Ensure server has started in tt-explorer shell instance (check for message below)
Starting Model Explorer server at:
http://localhost:8080

Running TT-Explorer Tests Locally

TT-Explorer relies on tests that are present in the tests/ directory as well as tests dynamically created through llvm-lit. Below are the steps to replicate the testing procedure seen in CI:

  1. Make sure you're in the tt-mlir directory
  2. You need to build the explorer target with cmake --build build -- explorer
  3. Run and save the system descriptor ttrt query --save-artifacts
  4. Save the system variable export SYSTEM_DESC_PATH=$(pwd)/ttrt-artifacts/system_desc.ttsys
  5. Run and generate ttnn + MLIR tests: cmake --build build -- check-ttmlir
  6. Save the relevant test directories:
    • export TT_EXPLORER_GENERATED_MLIR_TEST_DIRS=$(pwd)/build/test/python/golden/ttnn,$(pwd)/build/test/ttmlir/Silicon/TTNN/n150/perf
    • export TT_EXPLORER_GENERATED_TTNN_TEST_DIRS=$(pwd)/build/test/python/golden/ttnn
  7. Run the pytest for tt-explorer with pytest tools/explorer/test/run_tests.py

or in a concise shell script:

# Ensure you are present in the tt-mlir directory
source env/activate

# Build Tests
cmake --build build -- explorer
ttrt query --save-artifacts
export SYSTEM_DESC_PATH=$(pwd)/ttrt-artifacts/system_desc.ttsys
cmake --build build -- check-ttmlir

# Load Tests
export TT_EXPLORER_GENERATED_MLIR_TEST_DIRS=$(pwd)/build/test/python/golden/ttnn,$(pwd)/build/test/ttmlir/Silicon/TTNN/n150/perf
export TT_EXPLORER_GENERATED_TTNN_TEST_DIRS=$(pwd)/build/test/python/golden/ttnn

# Run Tests
pytest tools/explorer/test/run_tests.py

Visualizer tool for ttmlir-powered compiler results. Visualizes from emitted .mlir files to display compiled model, attributes, performance results, and provides a platform for human-driven overrides to gameify model tuning.

TT-Explorer - Project Architecture

TT-Explorer is a tool made to ease the pain of tuning a model and developing on Tenstorrent hardware. It provides a “Human-In-Loop” interface such that the compiler results can be actively tuned and understood by the person compiling the model. To complete this goal, the tool has to be designed such that users of any level of experience are all able to glean useful information from the visualization of the model, and be able to explore what the model does.

Software Architecture

The software will be built around the TT-Forge compiler to provide most of the functionality. Model Explorer will be used for the visualization functionality and as the main platform upon which TT-Explorer is built on.

Since Model-Explorer is built using Python, the majority of TT-Explorer will be structured in Python, with frequent use of the bindings to C++ provided by TT-MLIR.

The following components will be put together:

ttExplorerArchitecture

TT-Forge-FE (Front End)

TT-Forge FE is currently the primary frontend which uses TVM to transform conventional AI models into the MLIR in the TTIR Dialect.

Ingests: AI Model defined in PyTorch, TF, etc… Emits: Rudimentary TTIR Module consisting of Ops from AI Model.

TT-MLIR

TT-MLIR currently defines the out-of-tree MLIR compiler created by Tenstorrent to specifically target TT Hardware as a backend. It comprises a platform of several dialects (TTIR, TTNN, TTMetal) and the passes and transformations to compile a model into an executable that can run on TT hardware. In the scope of TT-Explorer the python bindings will be leveraged.

Ingests: TTIR Module, Overrides JSON Emits: Python Bindings to interface with TTIR Module, Overridden TTIR Modules, Flatbuffers

TT-Adapter

Model Explorer provides an extension interface where custom adapters can be implemented to visualize from different formats. TT-Adapter is the adapter created for TT-Explorer that parses TTIR Modules using the Python Bindings provided by TT-MLIR to create a graph legible by model-explorer. It also has an extensible REST endpoint that is leveraged to implement functionality, this endpoint acts as the main bridge between the Client and Host side processes.

Ingests: TTIR Modules, TT-MLIR Python Bindings, REST API Calls Emits: Model-Explorer Graph, REST API Results

TTRT

TT-RT is the runtime library for TT-Forge, which provides an API to run Flatbuffers generated from TT-MLIR. These flatbuffers contain the compiled results of the TTIR module, and TTRT allows us to query and execute them. Particularly, a performance trace can be generated using Tracy, which is fed into model-explorer to visualize the performance of operations in the graph.

Ingests: Flatbuffers Emits: Performance Trace, Model Results

Model-Explorer

Model Explorer is the backbone of the client and visualization of these models. It is deceptively placed in the “Client” portion of the diagram, but realistically TT-Explorer will be run on the host, and so will the model-explorer instance. The frontend will be a client of the REST API created by TT-Adapter and will use URLs from the model-explorer server to visualize the models. Currently TT maintains a fork of model-explorer which has overriden UI elements for overrides and displaying performance traces.

Ingests: Model Explorer Graph, User-Provided Overrides (UI), Performance Trace Emits: Overrides JSON, Model Visualization

These components all work together to provide the TT-Explorer platform.

Client-Host Design Paradigm

Since performance traces and execution rely on Silicon machines, there is a push to decouple the execution and MLIR-environment heavy aspects of TT-Explorer onto some host device and have a lightweight client API that uses the REST endpoint provided by TT-Adapter to leverage the host device without having to constantly be on said host. This is very useful for cloud development (as is common Tenstorrent). In doing so, TT-Explorer is a project that can be spun up in either a tt-mlir environment, or without one. The lightweight python version of TT-Explorer provides a set of utilities that call upon and visualize models from the host, the host will create the server and serve the API to be consumed.

TT-Explorer

This section provides a details about the usage of TT-Explorer.

Input Models

Currently TT-Explorer supports 3 types of models that can be executed/visualized.

Input TypeExecution SupportVisualization Support
.ttnn Flatbuffers with Debug Info✔️✔️
.ttnn Flatbuffers without Debug Info
.mlir TTIR Modules✔️✔️
.mlir TTNN Modules✔️

CLI

The CLI for tt-explorer provides a simple suite of options to start the UI:

tt-explorer -p <port> -u <url> -q

Options:

  • -p, --port: Port that model-explorer server will be exposed to. Default is 8080.
  • -u, --url: Host URL Address for server. Default is "localhost".
  • -q, --no-browser: Create server without opening a browser tab.

Example usage:

tt-explorer -p 8000 -u 0.0.0.0 -q

This command will start the TT-Explorer server on port 8000, accessible at the address 0.0.0.0, and without opening a browser tab.

UI

For general reference of the UI, refer to the model-explorer wiki. This section will highlight specific UI elements added to the Tenstorrent fork of model-explorer.

Model Execution

In the top right of the screen an additional element has been added to the top bar. It features the UI elements that invoke the execution functionality. Once the model has executed, overlays are also created. These overlays provide information on how the execution went.

Performance Overlay

The performance overlay is generated on every execution, it highlights the time it took to execute each node on the graph. This is visualized with a gradient from Yellow -> Red, with Yellow being the lowest time amongst all nodes on the graph, and Red being highest.

Accuracy Overlay

The accuracy overlay is only generated when executing from a compatible flatbuffer (.ttnn file extension with Debug Info). The overlay consists of either Green or Red node overlays. Green if the node passed a "golden" test, Red if not. The value for the overlay is the actual Pearson Correlation Coefficient (PCC) value with the "golden" tensor subtracted by the expected PCC value. If the number is < 0 we know it doesn't match the expected PCC, otherwise it is an accurate comparison.

Advanced Settings

This menu will open a window with some advanced settings for Model execution.

Opt. Policy

This dropdown provides a list of Optimization Policies which will be used when the model is executed. These policies are applied when lowering from a ttir module to an executable ttnn module.

Generate C++ Code

This toggle will run the EmitC pass in the tt-mlir compiler to generate TTNN C++ Code and make it available to you after running a model. Default value for this toggle is Off.

"Play" Button

This button invokes the execute function which will compile and execute the model. The button will then be "loading" until execution is finished. Once execution is finished a performance trace should be overlayed on the graph and it should reload.

"Code" Button

If the Generate C++ Code flag is set, this button will become available to view and download the C++ code in a window within explorer.

"Comment" Button

This button will open a window to view the shell logs while execution is running. If any errors occur they will be displayed here.

Overridden Fields

Certain Nodes on the graph will have attributes that are presented as a dropdown. These are fields which have overrides available. This value can be changed and then sent to be recompiled, invalid configurations will result in errors.

TT-Adapter

The following is a reference for the REST API provided by TT-Adapter. First, a short info-dump on how an extensible API can be built on top of Model Explorer.

Building an API using Model Explorer

The /apipost/v1/send_command endpoint provides an extensible platform with which commands are sent to be executed directly by the adapter specified. This becomes the main endpoint through which communication is facilitated between the server and client, the commands respond with an "adapter response".

Sending Commands

The body of the command must be JSON, and conform to the following interface (described below as a Typescript interface). Specific commands may narrow the field types or extend this interface.

interface ExtensionCommand {
  cmdId: string;
  extensionId: string;
  modelPath: string;
  settings: Record<string, any>;
  deleteAfterConversion: boolean;
}

More often than not, functions do not need all of these fields, but they must all be present to properly process the command sent into the handling function on the server.

Speaking of function, the signature that all function that handle commands on the server have to follow is as such:

class TTAdapter(Adapter):
  # ...
  def my_adapter_fn(self, model_path: str, settings: dict):
    # Parse model_path and settings objects as they are fed from send_command endpoint.
    pass

This function is invoked and called from a new instance every time. This is important to understand for the idea of persisting information on the server. As all requests to the server are stateless, the onus is often on the end-user to store and preserve important information such as the path of a model they've uploaded, or the paths of important artifacts that the server has produced. TTExplorer aims to make this as easy as possible.

Information can be processed in this function however the user would like to define, and often settings becomes a versatile endpoint to provide more information and context for the execution of some function. As an example, refer to TTAdapter:initialize, this function to load a SystemDesc into the environment has little to do with modelPath or deleteAfterConversion, as such these variables are not processed at all, and the function only executes a static initialization process regardless of the parameters passed into the command.

Example request

Below is an example of the JSON request sent from the UI to the server:

{
  // tt_adapter to invoke functions from TT-Adapter
  "extensionId": "tt_adapter",
  // Name of function to be run, "convert" is built into all adapters to convert some model to graph
  "cmdId": "convert",
  // Path to model on server to be fed into function
  "modelPath": "/tmp/tmp80eg73we/mnist_sharding.mlir",
  // Object holding custom settings to be fed into function
  "settings": {
    "const_element_count_limit": 16,
    "edge_label_font_size": 7.5,
    "artificial_layer_node_count_threshold": 1000,
    "keep_layers_with_a_single_child": false,
    "show_welcome_card": false,
    "disallow_vertical_edge_labels": false,
    "show_op_node_out_of_layer_edges_without_selecting": false,
    "highlight_layer_node_inputs_outputs": false,
    "hide_empty_node_data_entries": false
  },
  // `true` if file at `modelPath` is to be deleted after function run
  "deleteAfterConversion": true
}

Adapter Response

Model Explorer was probably not made to allow for such an extensible framework to be tacked onto it. As such, the adapter response is processed in a very particular way before it is sent back to the user. In particular, refer to model_explorer.utils.convert_adapter_response which is run on the output of every function.

This means that for compatibility reasons (i.e. to not stray too much from the upstream implementation that we are based off of) responses sent from the server must be in JSON format only and wrap the data on a graph property.

Below is the base typescript interface that the UI expects for the json response. Commands can define custom data inside the graph property.

/** A response received from the extension. */
interface ExtensionResponse<
  G extends Array<unknown> = Graph[],
  E extends unknown = string
> {
  graphs: G;
  error?: E;
}

For custom adapter responses. This limits the transfer of raw bytes data through different MIME Types, and requires the tt_adapter.utils.to_adapter_format which turns any dict object into a model explorer adapter compatible response. While this framework works well for graphs, it makes an "extensible" API difficult to implement.

Current API Reference:

Convert

Standard built-in conversion function, converts TTIR Module into Model Explorer Graph. Also provides settings as a platform for overrides to be applied to the graph.

Request

// As this is the base request everything is based off,
// this interface only narrows down the command to be "convert".
interface AdapterConvertCommand extends ExtensionCommand {
  cmdId: 'convert';
}

Response

// As this is the base response everything is based off,
// it is exactly the same as `ExtensionResponse`.
type AdapterConvertResponse = ExtensionResponse;
{
  "graphs": [{
    // Model Explorer Graph JSON Object
  }]
}

Initialize

Called from TTExplorer.initialize, used to Load SystemDesc into environment.

Request

interface InitializeCommand extends ExtensionCommand {
  cmdId: 'initialize';
}

Response

type AdapterInitializeResponse = ExtensionResponse<[{
  system_desc_path: string
}]>;
{
  "graphs": [{
    "system_desc_path": "<path to system_desc.ttsys>"
  }]
}

Execute

Called from TTExplorer.execute_model, executes a model.

Request

interface AdapterExecuteCommand extends ExtensionCommand {
  cmdId: 'execute';
}

Response

// When the request is successful, we don't expect any response back.
// Thus, an empty array is returned for `graphs`.
type AdapterExecuteResponse = ExtensionResponse<[]>;
{
  "graphs": []
}

Status Check

Called from ..., it is used for checking the execution status of a model and update the UI accordingly.

Request

interface AdapterStatusCheckCommand extends ExtensionCommand {
  cmdId: 'status_check';
}

Response

type AdapterStatusCheckResponse = ExtensionResponse<[{
  isDone: boolean;
  progress: number;
  total?: number;
  timeElapsed?: number;
  currentStatus?: string;
  error?: string;
  stdout?: string;
  log_file?: string;
}]>;
{
  "graphs": [{
    "isDone": false,
    "progress": 20,
    "total": 100,
    "timeElapsed": 234,
    "stdout": "Executing model...\nPath: /path/to/model",
    "log_file": "/path/to/log/on/the/server"
  }]
}

Override

Called from ... to send overrides made through the UI to the server for processing.

Request

interface KeyValue {
  key: string;
  value: string;
}

interface AdapterOverrideCommand extends ExtensionCommand {
  cmdId: 'override';
  settings: {
    graphs: Graph[];
    overrides: Record<string, {
      named_location: string,
      attributes: KeyValue[]
    }>;
  };
}

Response

type AdapterOverrideResponse = ExtensionResponse<[{
  success: boolean;
}]>;
{
  "graphs": [{
    "success": true
  }]
}

Editable attributes

To enable an attribute to be edited, a response coming from the server should contain the editable field on the attribute.

The typescript interface is as follows:

interface Graph {
	nodes: GraphNode[];
	// ...
}

interface GraphNode {
	attrs?: Attribute[];
	// ...
}

type EditableAttributeTypes = EditableIntAttribute | EditableValueListAttribute | EditableGridAttribute; // Attribute types are defined below...

interface Attribute {
	key: string;
	value: string;
	editable?: EditableAttributeTypes; // <- the editable attribute information
}

EditableIntAttribute

This editable attribute represents a list of integer values. It expects the attribute value to be formatted as a string, starting with [ and ending with ], with all values separated by ,. Like the example below:

[1, 2, 3]

The typescript interface for the editable attribute is this:

interface EditableIntAttribute {
	input_type: 'int_list';
	min_value?: number = 0;
	max_value?: number = 100;
	step?: number = 1;
}

Both min_value and max_value define the accepted range of values, and step define the number to increment or decrement per step.

The default range of values is between 0 and 100, inclusive, and the default step is 1. Thus by default, the value will increment or decrement by 1 each time to a minimum of 0 and a maximum of 100.

Here is an example of what this attribute look like:

{
  "graphs": [{
    "nodes": [
	    {
		    "attrs": [
			    {
				    "key": "shape",
				    "value": "[8, 8]",
				    "editable": {
					    "input_type": "int_list",
					    "min_value": 8,
					    "max_value": 64,
					    "step": 8
				    }
			    }
		    ]
	    }
    ]
  }]
}

EditableValueListAttribute

This editable attribute define a fixed list of string values to display.

The typescript interface for the editable attribute is this:

interface EditableValueListAttribute {
	input_type: 'value_list';
	options: string[];
}

The options property provides the list of options to be displayed. The current value will be added to this list and any duplicates will be removed.

Here is an example of what this attribute look like:

{
  "graphs": [{
    "nodes": [
	    {
		    "attrs": [
			    {
				    "key": "chip_arch",
				    "value": "wormhole",
				    "editable": {
					    "input_type": "value_list",
					    "options": [
						    "wormhole",
						    "grayskull"
					    ]
				    }
			    }
		    ]
	    }
    ]
  }]
}

EditableGridAttribute

The grid attribute is similar to to the integer list, with the main difference that you can specify a separator for the place the list will be split, and it doesn't need to be enclosed in bracket ([ and ]). The data for a grid attribute looks like this:

4x4x2

The typescript interface for the editable attribute is this:

interface EditableGridAttribute {
	input_type: 'grid';
	separator?: string = 'x';
	min_value?: number = 0;
	max_value?: number = 100;
	step?: number = 1;
}

Both min_value and max_value define the accepted range of values, and step define the number to increment or decrement per step.

The default range of values is between 0 and 100, inclusive, and the default step is 1. Thus by default, the value will increment or decrement by 1 each time to a minimum of 0 and a maximum of 100.

The separator attribute defines the character used to split the string, it defaults to "x".

Here is an example of what this attribute look like:

{
  "graphs": [{
    "nodes": [
	    {
		    "attrs": [
			    {
				    "key": "grid",
				    "value": "4x4",
				    "editable": {
					    "input_type": "grid",
					    "min_value": 4,
					    "max_value": 64,
					    "step": 4,
					    "separator": "x"
				    }
			    }
		    ]
	    }
    ]
  }]
}

Milestone 1 (v0.1)

Main Goal - Visualize & Execute

This will highlight half of the essential work that this tool should be able to do in both visualizing a model and executing it using the current TT-Forge stack. The frontend transformation of a model -> TTIR will be done outside of the scope of TT-Explorer at the moment. For this milestone TT-Explorer will be able to spin up a host-side and a client-side instance. The tool will be able to ingest TTIR modules to produce a visual result, and be able to execute this module. Ambitiously, the performance traces should be collected back into TT-Explorer to be displayed.

Tasks:

  • Load TTIR Modules and Visualize TTIR-Ops in Model Explorer
  • Create Extensible Notebook UX allowing for visualization and scripting capabilities
  • Add functionality to Model Explorer to load from re-compiled TTIR Modules (might be from JSON)
  • Add functionality to TT-MLIR to execute from Python Bindings
  • Create REST API skeleton in TT-Adapter
  • From REST API Call, Invoke python bindings to execute TTIR module using TT-Adapter
  • (If possible) Parse Perf Trace Artifact and visualize performance in Model-Explorer (as Node Data)

Milestone 2 (v0.2)

Main Goal - Model Editor

The primary function of TT-Explorer is to visualize and edit the model according to what the user defines as overrides the automatically generated compiler results. This milestone highlights that functionality in TT-Explorer, focusing around providing UI, TT-MLIR, and TT-Explorer features that enable the user to edit and tune a model “in-loop” with the TT-Forge compiler.

Tasks:

  • Flesh out and test locations ID such that operations can be tracked through the compiler stack.
  • Use Loc IDs to bind TTIR Ops with Tracy Perf Trace Artifact, and send to Model-Explorer to visualize.
  • Implement Overrides Functionality into TT-MLIR, tracking based on Loc IDs.
  • Overhaul UI to enable editing node attributes, use these updated fields to send information back to TT-Explorer via REST API (in the form of an Overrides JSON)
  • Parse Overrides JSON and apply Overrides over a REST API Call, visualize re-compiled graph now.
  • Provide REST API endpoint to provide “legal” options attached to Graph JSON.

Milestone 3 (v0.3+)

Main Goal - Matured Tool and Extensibility

The focus of this milestone is to transition TT-Explorer from a prototype tool into a mature visualization and editing tool for “Human-In-Loop” compilation. The tool is now planned to made extensible for other dialects and entry points forecast into TT-MLIR (Jax, StableHLO, etc…) and development of the visualization components of the tool provide feedback to upstream repos like model-explorer. Here the focus is on providing extensible interfaces for new UI elements (in supporting multi-chip and beyond), REST API, and Overrides.

Tasks:

  • Begin adding new dialects like .ttm, .ttnn to Model Explorer so that complied results can be inspected and analyzed to optimize at different steps of the compiler.
  • Add Accuracy/Performance Overlays as Node Data into the Model Explorer graph to visualize execution results
  • Enable interaction with ttnn-visualizer and other TT Visualizer tools to provide a more detailed view of execution results.
  • Start introducing InterOp with builtin adapters in model-explorer to support visualizing models from FE.
  • Use split panes to display graph transformations occuring through compiler, leveraging multiple dialects.
  • To be defined later, depending on the growth of the MLIR Project

ttnn-standalone

ttnn-standalone is a post-compile tuning/debugging tool.

Forge and third party ML models (PyTorch, Jax, ONNX, ...) can be compiled to a set of TTNN library op calls in C++. This generated code can then be used outside of the compiler environment. ttnn-standalone tool offers all the scaffolding needed to run the C++ code on device (build & run scripts).

Usage

# 1. Convert a model from TTIR dialect to EmitC dialect using ttmlir-opt
# 2. Translate the resulting EmitC dialect to C++ code using ttmlir-translate
# 3. Pipe the generated C++ code to a .cpp file
ttmlir-opt \
  --ttir-to-emitc-pipeline \
  test/ttmlir/EmitC/TTNN/sanity_add.mlir | \
ttmlir-translate \
  --mlir-to-cpp > \
  tools/ttnn-standalone/ttnn-standalone.cpp

# 1. Change dir to `tools/ttnn-standalone`
# 2. Use `run` script to compile and run the compiled binary
cd tools/ttnn-standalone
./run

Note: if you receive this error

-bash: ./run: Permission denied

running chmod +x run will set the execute permission on the script.

PyKernel Guide

PyKernel is a Python interface for developing custom TT-NN operations for Tenstorrent's AI accelerators. This guide explains how to use the PyKernel interface to implement your own TT-NN operations.

Introduction to PyKernel

PyKernel provides a Python-based framework to define hardware-specific kernels that can be used with the TT-NN framework. It allows developers to implement custom operations by defining compute kernels, reader/writer kernels, and control logic in a high-level Python interface.

The PyKernel framework consists of:

  • PyKernelOp: Base class that manages kernel selection, compilation, and execution
  • AST Module: Decorators and utilities for defining kernels
  • Types Module: Type definitions for PyKernel operations

PyKernel Architecture

Foundationally, PyKernel is a compiler built on top of 3 core components, described below.

Python ast Frontend

The frontend of PyKernel is made to parse Python code, the behaviour is enabled through using the ast (Abstract Syntax Tree) parser builtin to Python. By walking through the AST produced by this module, a MLIR Module is created with the ttkernel dialect (including others like arith, memref, scf). This MLIR Module is then piped into the next step of the PyKernel compiler. For more information about the type of kernel code that can be parsed by the Frontend, refer to the ttkernel Op spec.

Direct To Metal (D2M) Kernel Code Generation

Another component of the tt-mlir project that PyKernel is built on is the D2M compiler infrastructure. This infrastructure is made to dynamically create Kernels to performantly execute ML models. By replacing the entry point with the custom MLIR module created by the PyKernel Frontend, the same backend can be leveraged. This backend will take the MLIR module and run it through a series of rewritter passes such that it gets lowered to emitc, and eventually translated to C++ code. This C++ code is the artifact that is consumed by the runtime to execute on Tenstorrent Hardware.

TT-NN Generic Op

TT-NN comprises of python bound precompiled kernels and factories that operate in a manner similar to PyTorch. The Generic Op builds one step on top of this, intaking and operating on TT-NN tensors and primitives, but has a completely undefined factory and set of kernels, these must be provided into the generic op such that it can operate. PyKernel leverages this generality to deploy it's dynamically compiled C++ Kernels into the Generic Op and interface with TT-NN data as if a "custom" op was implemented. This is the glue that binds all of the compiler together.

Prerequisites

Before using PyKernel, ensure your environment is set up with:

  • TT-MLIR built and installed
  • Python 3.10 or newer
  • Required Python packages
  • TTMLIR_ENABLE_RUNTIME and TTMLIR_ENABLE_PYKERNEL flags set during build

Creating a Custom PyKernel Operation

To create a custom PyKernel operation, you need to:

  1. Create a class that inherits from PyKernelOp
  2. Define kernels using the @compute_thread(), @reader_thread(), or @writer_thread() decorators
  3. Implement the invoke method to create and connect kernels
  4. Define necessary circular buffers
  5. Create a program descriptor that combines kernels and circular buffers

Basic Structure

from pykernel.ast import *
from pykernel.op import PyKernelOp
from pykernel.types import *

import ttnn
import torch

class MyCustomOp(PyKernelOp):
    # Define compute kernel with appropriate decorator
    @compute_thread()
    def my_compute_kernel(cb_in: CircularBuffer, cb_out: CircularBuffer,
                         per_core_block_cnt: CompiledValue,
                         per_core_block_dim: CompiledValue):
        # Initialize the operation
        unary_op_init_common(cb_in, cb_out)

        # Process data in blocks
        for i in range(0, per_core_block_cnt, 1):
            cb_reserve_back(cb_out, per_core_block_dim)
            for j in range(0, per_core_block_dim, 1):
                # Kernel processing code here
                tile_regs_acquire()
                cb_wait_front(cb_in, 1)

                # Your custom processing logic
                # ...

                cb_pop_front(cb_in, 1)
                tile_regs_release()

            cb_push_back(cb_out, per_core_block_dim)
        return

    # Define reader kernel
    @reader_thread()
    def reader_kernel(cb_in: CircularBuffer, cb_out: CircularBuffer,
                     src_addr, num_tiles, start_id,
                     src_is_dram: CompiledValue):
        # Reader kernel code here
        return

    # Define writer kernel
    @writer_thread()
    def writer_kernel(cb_in: CircularBuffer, cb_out: CircularBuffer,
                     dst_addr, num_tiles, start_id,
                     dst_is_dram: CompiledValue):
        # Writer kernel code here
        return

    # The invoke method is the main entry point for kernel execution
    def invoke(self, *tensors, **options):
        # Create circular buffers for input and output tensors
        in_tensor, out_tensor = tensors
        cb_in = self.create_cb(in_tensor, 0)
        cb_out = self.create_cb(out_tensor, 1)

        # Prepare parameters for kernels
        start_id = 0
        is_dram = in_tensor.memory_config().buffer_type == ttnn.BufferType.DRAM
        num_tiles = options["num_tiles"]

        # Create kernels with appropriate parameters
        kernels = [
            self.create_kernel(
                MyCustomOp.my_compute_kernel,
                cb_in, cb_out,
                per_core_block_cnt=num_tiles,
                per_core_block_dim=1
            ),
            self.create_kernel(
                MyCustomOp.writer_kernel,
                cb_in, cb_out,
                out_tensor.buffer_address(),
                num_tiles, start_id,
                dst_is_dram=is_dram
            ),
            self.create_kernel(
                MyCustomOp.reader_kernel,
                cb_in, cb_out,
                in_tensor.buffer_address(),
                num_tiles, start_id,
                src_is_dram=is_dram
            )
        ]

        # Create and return the program descriptor
        return self.create_program(kernels, [cb_in, cb_out])

Kernel Types

PyKernel supports different types of kernels:

  1. Compute Kernels: Process data on the compute units (e.g., SFPU - Scalar Floating-Point Unit)
  2. Reader Kernels: Transfer data from memory to circular buffers
  3. Writer Kernels: Transfer data from circular buffers to memory

Each kernel type has a specific decorator:

  • @compute_thread() - For compute kernels that run on TenSix cores
  • @reader_thread() - For reader kernels that transfer data from memory to circular buffers
  • @writer_thread() - For writer kernels that transfer data from circular buffers to memory

These decorators handle the compilation of Python code into hardware-specific kernels. You can also use the older style decorators if needed:

  • @ttkernel_tensix_compile() - Equivalent to @compute_thread()
  • @ttkernel_noc_compile() - For both reader and writer kernels

Circular Buffers

Circular buffers are used to transfer data between kernels and memory. In the PyKernel framework, there are two aspects of circular buffers:

  1. CircularBuffer class: Used in kernel definitions to represent a circular buffer
  2. CB Descriptors: Used at runtime to configure the actual hardware circular buffers

CircularBuffer Class

The CircularBuffer class is defined in pykernel.types and is used in kernel definitions:

class CircularBuffer:
    def __init__(self, cb_id, tensor_shape=(8, 128, 128), dtype="Float32"):
        self.cb_id = cb_id
        self.tensor_shape = tensor_shape
        self.tile_shape = 32  # default to 32x32 tile shape
        self.tilized_shape = self.get_tilized_memref_shape()
        self.dtype = dtype

Creating Circular Buffers in the Invoke Method

In your custom operation's invoke method, you can create circular buffers using the create_cb helper method from the PyKernelOp base class:

def invoke(self, *tensors, **options):
    in_tensor, out_tensor = tensors
    cb_in = self.create_cb(in_tensor, 0)  # buffer_index=0
    cb_out = self.create_cb(out_tensor, 1)  # buffer_index=1

    # Use cb_in and cb_out in kernel creation
    # ...

    return self.create_program(kernels, [cb_in, cb_out])

The create_cb method handles the creation of the necessary format descriptors and buffer descriptors based on the tensor properties:

Example: EltwiseSFPU Operation

The EltwiseSFPU operation applies an exponential function element-wise to an input tensor. Let's examine a complete implementation based on the demo in test/pykernel/demo/eltwise_sfpu_demo.py:

1. Define the Operation Class

from pykernel.ast import *
from pykernel.op import PyKernelOp
from pykernel.types import *

import ttnn
import torch

class EltwiseSFPUPyKernelOp(PyKernelOp):
    # Kernel implementations will go here

2. Define the Compute Kernel

@compute_thread()
def eltwise_sfpu(
    cb_in: CircularBuffer,
    cb_out: CircularBuffer,
    per_core_block_cnt: CompiledValue,
    per_core_block_dim: CompiledValue,
):
    # Initialize the operation
    unary_op_init_common(cb_in, cb_out)

    # Process tiles
    for i in range(0, per_core_block_cnt, 1):
        cb_reserve_back(cb_out, per_core_block_dim)
        for j in range(0, per_core_block_dim, 1):
            tile_regs_acquire()
            cb_wait_front(cb_in, 1)

            # Copy input tile to register
            copy_tile(cb_in, 0, 0)

            # Apply exponential function
            exp_tile_init()
            exp_tile(0)

            # Commit results
            tile_regs_commit()
            tile_regs_wait()
            pack_tile(0, cb_out, 0)

            cb_pop_front(cb_in, 1)
            tile_regs_release()

        cb_push_back(cb_out, per_core_block_dim)
    return

3. Define Writer Kernel

@writer_thread()
def writer_unary_interleaved(
    cb_in: CircularBuffer,
    cb_out: CircularBuffer,
    dst_addr,
    num_tiles,
    start_id,
    dst_is_dram: CompiledValue,
):
    onetile = 1
    tile_bytes = get_tile_size(cb_out)
    dataformat = get_dataformat(cb_out)

    s0 = get_interleaved_addr_gen_fast(
        dst_is_dram, dst_addr, tile_bytes, dataformat
    )

    end_id = start_id + num_tiles
    ii: int = start_id
    for i in range(start_id, end_id, onetile):
        cb_wait_front(cb_out, onetile)
        l1_read_addr = get_read_ptr(cb_out)
        noc_async_write_tile(ii, s0, l1_read_addr)
        noc_async_write_barrier()
        cb_pop_front(cb_out, onetile)
        ii += onetile
    return

4. Define Reader Kernel

@reader_thread()
def reader_unary_interleaved(
    cb_in: CircularBuffer,
    cb_out: CircularBuffer,
    src_addr,
    num_tiles,
    start_id,
    src_is_dram: CompiledValue,
):
    onetile = 1
    tile_bytes = get_tile_size(cb_in)
    dataformat = get_dataformat(cb_in)

    s0 = get_interleaved_addr_gen_fast(
        src_is_dram, src_addr, tile_bytes, dataformat
    )

    end_id = start_id + num_tiles
    ii: int = start_id
    for i in range(start_id, end_id, onetile):
        cb_reserve_back(cb_in, onetile)
        l1_write_addr = get_write_ptr(cb_in)
        noc_async_read_tile(ii, s0, l1_write_addr)
        noc_async_read_barrier()
        cb_push_back(cb_in, onetile)
        ii += onetile
    return

5. Implement the Invoke Method

The invoke method is the critical part that connects the kernels together and creates the program descriptor:

def invoke(self, *tensors, **options):
    # Extract input and output tensors
    in_tensor, out_tensor = tensors

    # Create circular buffers
    cb_in = self.create_cb(in_tensor, 0)
    cb_out = self.create_cb(out_tensor, 1)

    # Set up parameters
    start_id = 0
    is_dram_input = in_tensor.memory_config().buffer_type == ttnn.BufferType.DRAM
    num_tiles = options["num_tiles"]

    # Create kernels with appropriate parameters
    kernels = [
        self.create_kernel(
            EltwiseSFPUPyKernelOp.eltwise_sfpu,
            cb_in,
            cb_out,
            per_core_block_cnt=num_tiles,
            per_core_block_dim=1,
        ),
        self.create_kernel(
            EltwiseSFPUPyKernelOp.writer_unary_interleaved,
            cb_in,
            cb_out,
            out_tensor.buffer_address(),
            num_tiles,
            start_id,
            dst_is_dram=is_dram_input,
        ),
        self.create_kernel(
            EltwiseSFPUPyKernelOp.reader_unary_interleaved,
            cb_in,
            cb_out,
            in_tensor.buffer_address(),
            num_tiles,
            start_id,
            src_is_dram=is_dram_input,
        ),
    ]

    # Create and return the program descriptor
    return self.create_program(kernels, [cb_in, cb_out])

Running the EltwiseSFPU Demo

The EltwiseSFPU demo demonstrates applying an exponential function element-wise to a tensor. This can be run using the pykernel-demo target:

source env/activate
# Ensure the TTMLIR_ENABLE_RUNTIME and TTMLIR_ENABLE_PYKERNEL flags are set during build
cmake --build build -- pykernel-demo

Demo Breakdown

Let's examine how to use the PyKernel operation in practice:

# Open a device
device = ttnn.open_device(device_id=0)

# Define tensor shapes and data
num_tiles = 4
shape = [1, num_tiles, 32, 32]
data = torch.rand(shape).to(torch.bfloat16)

# Configure memory
dram_memory_config = ttnn.DRAM_MEMORY_CONFIG

# Create input tensor
input_tensor = ttnn.from_torch(
    data,
    dtype=ttnn.bfloat16,
    layout=ttnn.TILE_LAYOUT,
    device=device,
    memory_config=dram_memory_config,
)

# Create output tensor
output_tensor = ttnn.allocate_tensor_on_device(
    ttnn.Shape(shape),
    ttnn.bfloat16,
    ttnn.TILE_LAYOUT,
    device,
    dram_memory_config,
)

# Prepare tensors for the operation
io_tensors = [input_tensor, output_tensor]

# Create the custom operation
eltwise_exp_op = EltwiseSFPUPyKernelOp()

# Execute the operation with the tensors and options
output = eltwise_exp_op(*io_tensors, num_tiles=num_tiles)

# Compare with the built-in exponential operation
golden = ttnn.exp(input_tensor)

# Convert to torch tensors for comparison
torch_golden = ttnn.to_torch(golden)
torch_output = ttnn.to_torch(output)

# Verify results
matching = torch.allclose(torch_golden, torch_output)
print(f"Tensors are matching: {matching}")
assert matching

This demo shows the complete workflow:

  1. Opens a device
  2. Creates input and output tensors with appropriate memory configuration
  3. Instantiates the EltwiseSFPUPyKernelOp class
  4. Executes the operation by calling the op with tensors and options
  5. Compares the result with the built-in TT-NN implementation

Comparison with Native TT-NN Operations

PyKernel operations integrate seamlessly with native TT-NN operations. As shown in the demo, you can compare your custom PyKernel operation with built-in TT-NN operations:

# Execute your custom PyKernel operation
output = eltwise_exp_op(*io_tensors, num_tiles=num_tiles)

# Execute the equivalent built-in TT-NN operation
golden = ttnn.exp(input_tensor)

# Convert both to torch tensors for comparison
torch_golden = ttnn.to_torch(golden)
torch_output = ttnn.to_torch(output)

# Verify the results match
matching = torch.allclose(torch_golden, torch_output)
print(f"Tensors are matching: {matching}")
assert matching

This approach allows you to:

  1. Validate your custom operation against known implementations
  2. Benchmark performance differences between custom and built-in operations
  3. Extend the TT-NN framework with operations not available in the standard library

Building and Testing

To build and test PyKernel, you need to enable both the runtime and PyKernel components:

source env/activate

# Configure with PyKernel enabled
cmake -G Ninja -B build \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_C_COMPILER=clang-17 \
    -DCMAKE_CXX_COMPILER=clang++-17 \
    -DTTMLIR_ENABLE_RUNTIME=ON \
    -DTTMLIR_ENABLE_PYKERNEL=ON

# Build the project
cmake --build build

# Run the PyKernel demo
cmake --build build -- pykernel-demo

The TTMLIR_ENABLE_RUNTIME and TTMLIR_ENABLE_PYKERNEL flags are essential for PyKernel functionality. Without these flags, the PyKernel components will not be built.

Best Practices

When developing with PyKernel, follow these best practices:

  1. Separate concerns: Keep compute, reader, and writer kernels separate for better maintainability and reusability

  2. Use appropriate decorators: Apply the correct decorator for each kernel type:

    • @compute_thread() for compute kernels
    • @reader_thread() for reader kernels
    • @writer_thread() for writer kernels
  3. Implement the invoke method properly: The invoke method is critical as it connects all components:

    • Create circular buffers with appropriate parameters
    • Set up kernel parameters correctly
    • Create kernels with the right arguments
    • Return a program descriptor that includes all kernels and circular buffers
  4. Handle memory configurations: Be aware of memory types (DRAM vs L1) when creating kernels

  5. Reuse kernels: Create reusable kernels for common operations to avoid code duplication

  6. Leverage caching: PyKernelOp automatically caches compiled kernels for performance

  7. Test thoroughly: Always compare results with reference implementations or built-in TT-NN operations

  8. Document parameters: Clearly document the expected parameters for your PyKernel operation

Summary

PyKernel provides a flexible and powerful way to implement custom operations for Tenstorrent hardware. By following the pattern outlined in this guide, you can create your own operations that integrate seamlessly with the TT-NN framework.

Key components of the PyKernel framework:

  1. PyKernelOp base class: Handles kernel management, compilation, and caching
  2. Kernel decorators: @compute_thread(), @reader_thread(), and @writer_thread()
  3. CircularBuffer class: Represents circular buffers in kernel definitions
  4. invoke method: The critical implementation that connects kernels and creates the program

The workflow for creating a custom PyKernel operation is:

  1. Create a class that inherits from PyKernelOp
  2. Define compute, reader, and writer kernels with appropriate decorators
  3. Implement the invoke method to create circular buffers and connect kernels
  4. Use the operation by instantiating your class and calling it with tensors and options

With PyKernel, you can extend the TT-NN framework with custom operations that leverage the full power of Tenstorrent hardware while maintaining a clean, high-level Python interface.

Creating Bug Repros for TTNN Using TT-MLIR Codegen

While developing in tt-mlir, it's not uncommon to encounter bugs originating in the TTNN library. To isolate and report such bugs, a practical approach is to use the C++ codegen feature (EmitC) to generate a minimal repro. This guide walks you through how to create such repros and integrate them into the tt-metal repository, where TTNN is developed.


Step-by-Step Guide

Note: If you run into issues while following these steps, check the Known Issues section at the end of this guide for common problems and solutions.

1. Generate C++ Code from TT-MLIR

Use the ttnn-standalone tool to run the compiler and emit C++ code.

📖 See ttnn-standalone.md for instructions on how to generate C++ code from your MLIR input using EmitC.

2. Scope Down the Repro

Once you've generated the C++ code:

  • Use the ttnn-standalone tool to run and debug it in isolation.
  • Reduce the repro to the minimal example that still triggers the bug.
  • Confirm the issue still reproduces reliably.

3. Clone the TT-Metal Repository

Clone the tt-metal repo:

git clone git@github.com:tenstorrent/tt-metal.git
cd tt-metal

4. Add the Repro to the GTest Infrastructure

Place your .cpp file in:

tests/ttnn/unit_tests/gtests/emitc/

and add it to the cmake file:

tests/ttnn/unit_tests/gtests/CMakeLists.txt

like so:

set(EMITC_UNIT_TESTS_SRC
    ${CMAKE_CURRENT_SOURCE_DIR}/emitc/test_sanity.cpp
    ${CMAKE_CURRENT_SOURCE_DIR}/emitc/your_test_name.cpp  # <<<===
)

Use the existing file test_sanity.cpp in that directory as a reference.

5. Modify the Repro for GTest

There are some modifications that need to be made in order to fit the GTest infra:

  • Convert the main() function to a TEST(...) macro:
TEST(EmitC, YourTestName) {
    // Your original main function body here
}
  • Remove any return statements from the TEST(...) function body.
  • Replace #include "ttnn-precompiled.hpp" with #include "emitc.hpp"

6. Build the TTNN EmitC Tests

First, activate the python virtual env, and set some env variables:

source python_env/bin/activate
export TT_METAL_HOME=$(pwd)
export PYTHONPATH=$(pwd)

Then, build the tests:

./build_metal.sh --build-ttnn-tests

Note: some unrelated gtests might fail here, we can ignore them.

7. Run the EmitC Unit Tests

To run all EmitC tests:

./build/test/ttnn/unit_tests_ttnn_emitc

To run a specific test:

./build/test/ttnn/unit_tests_ttnn_emitc --gtest_filter=EmitC.YourTestName

8. Share the Repro

  • Create a branch with your changes.
  • Open a GitHub issue or comment on an existing one.
  • Link to your branch and include the instructions for running the repro
./build_metal.sh --build-ttnn-tests
./build/test/ttnn/unit_tests_ttnn_emitc
./build/test/ttnn/unit_tests_ttnn_emitc --gtest_filter=EmitC.YourTestName

Known Issues

  • Missing sfpi compiler or other dependencies If you encounter errors about a missing sfpi compiler or other system-level dependencies, refer to the tt-metal installation guide for instructions on installing the required packages.

  • TTNN test compilation failures If the build fails when compiling TTNN tests, inspect the specific tests that caused the failure. If the failures are unrelated to EmitC tests, they can typically be ignored — this is a known issue.

Python Bindings

This page aims to clarify, document, and de-mystify the tt-mlir python bindings. It will do so by first highlighting the mechanism with which these bindings are generated and exposed to users. It will then document the nuances of nanobind, and the different parts of these bindings that must be written in by hand. Finally, it will go through a hands-on example of how to add your own functionality to the tt-mlir python bindings.

nanobind

Nanobind is the successor of the ubiquitous pybind project. In almost the same syntactical form, it provides a framework to define InterOp between C++ and Python. For more information about nanobind specifically, I'd recommend reading through the documentation. MLIR (and by extension: tt-mlir) leverages nanobind to create bindings for the C++ framework of Dialects, Ops, Types, Attributes, and Passes to be used in Python.

MLIR in Python

This section highlights the machinery and configuration with which MLIR can be exposed to Python, while still maintaining functional interop with the C++ code. For more context and information feel free to read the MLIR Python Documentation.

C-API

While the documentation provides a very lack-lustre explanation as to why the C-API exists, I am here to provide my take on the existence and purpose of the MLIR CAPI.

RTTI

MLIR, being a part of the llvm-project, follows their "custom" RTTI. For this reason, the entire C++ portion of the project isn't built with RTTI to enable to custom functionality. nanobind, however, requires RTTI to perform a lot of the casting and transformation required to interop with Python. This conflict leads to the natural desire for an alternative.

C doesn't have RTTI, it's a stable language without the extra convenience and machinery presented in C++. If a C-API were present, the python bindings can link against the C-API, relying on externally defined NanobindAdaptors to do the type conversions using nanobind mechanisms instead of relying on the C++/LLVM RTTI for the Python bindings.

C++ ABI

The C++ Application Boundary Interface (ABI) proves to be a challenging barrier to accessing functionality from C++. Without a defined stable ABI, it becomes difficult to deal with some of the complexity required to package and InterOp with Python. Specifically, dealing with templates, inheritance, and RTTI can prove quite the challenge.

To simplify this process, C provides a relatively stable ABI. The C-API also acts as a wrapper around the complex C++ functions, providing a simple "trampoline" for Python to link against.

nanobind x C-API Functionality

In the previous section, I mentioned NanobindAdaptors. This file helps to define some of the key design decisions made when linking the Python bindings against the C-API instead of the underlying C++ API. Functionally, the Python bindings act as a "wrapper" around the CAPI, exposing the functionality through python.

include/mlir-c/Bindings/Python/Interop.h

This file is key to defining the InterOp between the C-API and Python w.r.t. maintaining and accessing information in a pointer. It exposes an AI that interfaces immediate data pointers with python capsules. PyCapsules are essentially thin wrappers around data pointers in Python. The critically contain data (void*), destructor method, and a name.

Within the Interop, the assumption is that the data's ownership and lifetime is managed by some bound object that was created in C++. This file merely provides the API with which the underlying data pointer is passed around as either a PyCapsule or the raw pointer, and this file provides the type conversion utilities to convert between Python and C from an underlying object.

include/mlir/CAPI/Wrap.h

This header defines the API to InterOp between C-API objects and their C++ equivalent. By calling wrap() on a C++ MLIR object to have the underlying data create a C-API object on the same memory, and unwrap() does it the other way around.

They key caveat with this wrapping/unwrapping is the ownership over the lifetime of the data itself. The constructors for almost all of the primitives have already been defined in C++. As such the syntax for creating a new C-API object is more the syntax of creating an object in C++ and wrapping it into a CAPI object. The lifetime of the pointer is therefore maintained by the CAPI object as it gets passed around in return objects.

include/mlir/Bindings/Python/NanobindAdaptors.h

As the CAPI object gets bounced around in memory, the ownership and lifetime of the data must eventually reach python to be controlled by the user. The implementation details are not relevant to this component as to how the data reaches python. This component provides the utility to create copies of the underlying data and send them through nanobind, effectively framing itself as the InterOp component between CAPI objects and their nanobind equivalents.

Through the carefully created contract between these components of the MLIR project, the IR primitives are exposed to Python, created in C++, and bounced off of the C-API. While I may have gleaned over the other supporting mechanisms in this explanation, explore the parent directories for these three files for a more detailed look into the semantics of ownership and such.

Defining the C-API.

For primitives to be defined for use in Python, they must first be implemented in C++. This is outside of the scope of the Python specific code, please refer to the rest of tt-mlir documentation for references on this. Once the C++ functionality is defined, the C-API must be constructed on top of this to serve as the "InterOp" layer.

get & Constructing C-API Objects

Since most constructors for IR primitives are created in C++, the goal is to construct objects in C++, but have the ownership exposed to Python. We do this through the creation of a Get function. The get function will essentially intake primitive C-types, and invoke the ::get operator in C++ to construct the object. A simple code example for the ttkernel.TileType is shown below:

include/ttmlir-c/TTTypes.h


// We export the function outside of the scope of "C" such that it can be defined later using C++ methods.

MLIR_CAPI_EXPORTED MlirType ttmlirTTTileTypeGet(MlirContext get, unsigned height, unsigned width, uin32_t dataType);

lib/CAPI/TTTypes.cpp


MlirType ttmlirTTTileTypeGet(MlirContext ctx, unsigned height, unsigned width, uint32_t dataType) {
    // We return the **wrapped** created C++ object, transferring Ownership to the C-API
    return wrap(
        TileType::get(
            unwrap(ctx), // Now we unwrap the MlirContext object to cast it to a mlir::MLIRContext object (w/o affecting ownership)
            llvm::SmallVector<std::int64_t>{height, width}, // We construct the list here since a list isn't natively defined in the C-API,
            static_cast<tt::DataType>(dataType) // Here we cast the int value to get the Enum value from `tt::DataType`
        ) // Invoking the builtin get operator to create and get the pointer for some object
    );
}

The key details to note are the reliance on C++ methods in the get definition like intiializer lists. By leveraging the InterOp the get method will return a pointer which can easily be represented in the C-API and owned as such, while masking the complexities of the C++ underneath from nanobind. Definitions such as these must either be written by hand (as shown above), or they can automatically be generated for certain IR primitives. We will learn more about that below.

Generating Bindings

This section will outline the mechanism with which bindings are generated, and the intricacies of this step.

Declaring Python Bindings

The first step to kicking off binding generation is to declare that they should exist for some dialect. MLIR provides a CMake module (AddMLIRPython) which exposes the following utility functions which can be declared to state what Python bindings are generated. For more information about the specific arguments and expected structure of these CMake functions refer to the AddMLIRPython module and python/CMakeLists.txt.

declare_mlir_python_sources

Overview
This function provides an interface to directly copy .py source files into the final built python module.

Key Arguments

  • ADD_TO_PARENT defines the Parent name to which this source will be added to, inheriting the location.

Usecases

  • We use it to declare generic "Parents" which contain the generated/declared python files from many of the submodules within the dialects.
  • We use it to directly copy over key test infrastructure like ttir_builder as purely python programmed modules.

declare_mlir_dialect_python_bindings

Overview
This function is the key to invoking the mechanism to generate python bindings from Tablegen Definitions.

Key Arguments

  • TD_FILE Relative to ROOT_DIR, where the Tablegen Definition file to build bindings off of is located. Note: This currently just forwards the TD files from include/ttmlir/Dialect.
  • SOURCES Raw python files associated with bindings. Note: These files will essentially forward the generated modules forward.
  • GEN_ENUM_BINDINGS_TD_FILE if GEN_ENUM_BINDINGS is ON, this will build enum bindings from the defined Tablegen file.
  • DIALECT_NAME What name the dialects should be generated under.

Usecases

  • We use this CMake function to define and generate the bindings for the ttkernel, ttir, tt, and ttnn dialects.

declare_mlir_python_extension

Overview
This is the CMake function used to link C++ Source Files + declared nanobinds into the generated python module.

Key Arguments

  • EMBED_CAPI_LINKS_LIBS This is to declare the libraries used to link against the CAPI in the bindings. Learn more in the CAPI section below.
  • PRIVATE_LINK_LIBS Declares other libraries that are linked against the Python bindings.

Usecases

  • We use this function to build and link all of our custom nanobinds and hand-written Type/Attr bindings into the ttmlir module.

add_mlir_python_common_capi_library

Overview
This function adds a shared library embedding all of the core CAPI libs needed to link against extensions.

add_mlir_python_modules

Overview
This is the final packaging function of the python bindings, linking all of the sources together and packaging it into a built module.

Building MLIR Primitives from Tablegen

The declare_mlir_dialect_python_bindings leverages a mechanism of the mlir-tblgen to build the python bindings for some defined dialect. What are the intricacies of this functionality?

mlir-tblgen

This tool parses .td Tablegen files to automatically generate C++ code to implement that functionality in MLIR. We leverage the Tablegen language to define our dialects in tt-mlir, and this tool is exactly what gets invoked to build and generate the code to functionally use this dialect in our codebase.

Trivial Constructors

To deal with automatically generating the functionality around an Operation, a certain amount of generality is needed to deem the problem trivial enough to generate. All of the IR primitives are thankfully able to be constructed from .td to their relevant C++ implementations. However, as shown in the TileType example, the conversion from simple C primitives (+ pre-defined MLIR C-API types) to C++ get functions isn't trivial. For this reason, we can start to analyze the IR primitives and deem which ones are trivial for C-API generation, and which must be implemented by hand.

  • enum
    • The enum type can be considered very generic. With the underlying data storage type being integral values, and an optional String representation in MLIR. By iterating over all of the user defined enum values, a very trivial constructor can be made to automatically generate enums.
  • operation
    • Operations are a unique case where the constructor isn't often generic enough; however, the OperationState exists as a strictly defined struct which contains all of the relevant construction details and implementation requirements for an operation. For this reason, while it is not trivial, it is generic enough that the OperationState can be relied on to form a mechanism which automatically generates C-API builders.
  • Types/Attributes
    • Types and Attributes unfortunately receive the short end of the stick. Their constructors are wildly generic, and there is no baseline for what is required in the construction of a Type/Attr. For this reason, at the current moment these primitives aren't supported for automatic generation in mlir-tblgen, and must be defined by hand.

Writing Bindings

With the understanding that not all bindings can be automatically generated for us, we can head into the intricacies of defining your own bindings for Types/Attrs.

LLVM-Style Pythonic "Type Information" + Casting

An important caveat to introduce before entering the domain of writing our own bindings is the understanding of how MLIR approaches the problem of downcasting w.r.t. IR primitives. Considering the C-API doesn't have an inheritance structure, Python is required to uphold the inheritance structure and hold the type information such that casting is possible between primitives and their specific implementation (ex: going from MlirAttribute -> TTNNLayoutAttr).

This mechanism can be exposed to Python in multiple different ways, where MLIR supports a specific implementation of an mlir_attribute_class and mlir_type_class which intake 2 additional C-API functions. To initialize a class using this structure the following functions are required:

  • myAttributeGet: to construct the Type/Attr
  • myAttributeGetTypeID: provides a unique static TypeID for myAttribute
  • isAMyAttribute: boolean to see if higher level type is of the same type.

This will then provide an interface where in python a type can be cast by calling the constructor method of some downcasted type:

# Example to downcast using MLIR provided methods.
my_attribute = mlir.MyAttribute(attr: _mlir.ir.MlirAttribute)

Choosing a direct C++ structure instead of C-API

Those who are familiar with the tt-mlir python bindings may be aware that our code structure looks drastically different from this, why is that? The answer lies in the redundancy and lack of extensive use of the nanobind mechanisms around tt-mlir Python bindings.

As mentioned in the C-API section, the C-API is required to form the contract between C++ -> Python, to reduce the collisions with RTTI and the unstable ABI from C++. That being said, it's not unsupported to still directly access C++ members from nanobind and skip the C-API Builder functions, instead just opting to create in C++ directly and then wrap that result. This is the approach taken "consciously" in the tt-mlir python bindings.

What are the consequences of this design decision? The advantages?

Direct MLIR Casting Support

Instead of relying on Python for casting, and defining C-API functions to support this functionality; this approach allows us to directly use mlir::isa, mlir::cast, etc... in it's place.

For example, we support tt_attribute_class and tt_type_class, which leverage isa and dyn_cast to downcast to Types and Attrs by wrapping the Python types and operating on the underlying C++ types.

This also brings about some potential collisions with RTTI from nanobind. None are present in the bindings (as far as I know), but the bindings are exposed to this problem moving forward.

Simpler Initialization Structures

Instead of having to invoke a C-API function to define the get method in nanobind we can directly invoke the wrap(CppType::get(...)) functionality that the C-API ends up calling. The primary difference is the native support for complex data structures like vector and map through nanobind. Take for example the following initialization for an attribute:

// C-API Definition for myAttributeGet
MlirAttribute myAttributeGet(MlirContext ctx, int* array, size_t arraySize) {
    return wrap(MyAttribute::get(ctx, std::vector<int>{array, array + arraySize}));
}

// nanobind direct invocation
tt_attribute_class(m, "MyAttribute")
    .def_static("get", [](MlirContext ctx, std::vector<int> arr) {
        return wrap(MyAttribure::get(ctx, arr));
    })

// nanobind invocation through C-API
mlir_attribute_class(m, "MyAttribute", myAttributeGetTypeId, isAMyAttribute)
    .def_static("get", [](MlirContext ctx, std::vector<int> arr) {
        return myAttributeGet(ctx, arr.data(), arr.size());
    })
// Note: While this may seem like a trivial change, the cost for retaining the function signature in C begins to grow very quickly. Especially when considering maps and more complex data structures.

Again, this does come with some nuances w.r.t. the ABI, but for our simple usecase of the bindings it can be considered acceptable...

Wait... why are we still defining the CAPI Builders Then?

This leads to an underlying question: What's the point of still defining the CAPI functions if we actually never end up using them? The answer is that we would ideally still maintain the infrastructure to backtrack our changes if we end up making more extensive use of the Python bindings and come across nasty ABI/RTTI issues, or MLIR upstreams significant changes to the Python bindings where we would have to leverage their architecture. With regards to the latter, I have asked some of the contributors and received "iffy" responses, with the general answer being that major changes are not planned for the MLIR Python bindings infrastructure.

That being said, for the low low cost of a few redundant functions being defined, we have a clear backup route in case the Python bindings blow up in our faces. I do think this argument is built on significant personal opinion, in the future we may change the strategy for the bindings. For now, it makes the structure of our python code cleaner, while having a clear route forward if something breaks.

Each MLIR project I've used as a reference approaches the problems differently. AFAIK the bindings are generally defined however the end user desires to invoke them :)

General Structure

Considering that mlir-tblgen will handle the generation of the underlying C++ code, we only need to define the C Builders and the nanobinds for each of the Types/Attrs we would like to add.

This often comprises of the following contributions:

  • Declaring the C-API Header Function(s) in include/ttmlir-c
  • Defining the C-API Function(s) in lib/CAPI
  • Writing out the nanobind for that Type/Attr in python/.

Example: Defining ttkernel Python Bindings

In this section, we will go through a worked example on the different steps required to expose functionality for the TTKernel dialect.

  1. We will continue while assuming that the TTKernel dialect has been defined using Tablegen and already has a valid target that compiles the C++ functionality. We will also assume that the current CMake build targets and functionality that uphold the rest of the ttmlir dialects already exists.
  2. Declare and register the TTKernel dialect in the C-API by calling the MLIR_DECLARE_CAPI_DIALECT_REGISTRATION(TTKernel, ttkernel); macro in include/ttmlir-c/Dialects.h:
// File: include/ttmlir-c/Dialects.h

#include "mlir-c/IR.h"

#ifdef __cplusplus
extern "C" {
#endif

MLIR_DECLARE_CAPI_DIALECT_REGISTRATION(TTKernel, ttkernel);

#ifdef __cplusplus
}
#endif
  1. Declare CAPI Builder for all of the Types (namely only CBType needs to be implemented) in include/ttmlir-c/TTKernelTypes.h
// File: include/ttmlir-c/TTKernelTypes.h

#include "ttmlir-c/Dialects.h"

#ifdef __cplusplus
extern "C" {
#endif

MLIR_CAPI_EXPORTED MlirType ttmlirTTKernelCBTypeGet(
    MlirContext ctx, uint64_t port, uint64_t address,
    MlirType memrefType);

#ifdef __cplusplus
}
#endif
  1. Declare the CAPI builder target in lib/CAPI/CMakeLists.txt by adding TTKernelTypes.cpp as a source to TTMLIRCAPI.
  2. Define the Dialect by formalling applying the generated Dialect type into the CAPI_DIALECT_REGISTRATION macro.
// File: lib/CAPI/Dialects.cpp

#include "ttmlir-c/Dialects.h"

#include "mlir/CAPI/Registration.h"
#include "ttmlir/Dialect/TTKernel/IR/TTKernel.h"

MLIR_DEFINE_CAPI_DIALECT_REGISTRATION(
    TTKernel, ttkernel, mlir::tt::ttkernel::TTKernelDialect)
  1. Define the CAPI get method for CBType
// File: lib/CAPI/TTKernelTypes.cpp

#include "ttmlir-c/TTKernelTypes.h"
#include "mlir/CAPI/IR.h"
#include "mlir/CAPI/Support.h"

#include "ttmlir/Dialect/TTKernel/IR/TTKernelOpsTypes.h"

using namespace mlir::tt::ttkernel;

MlirType ttmlirTTKernelCBTypeGet(MlirContext ctx, uint64_t port, uint64_t address, MlirType memrefType) {
  return wrap(CBType::get(unwrap(ctx), symbolizeCBPort(port).value(), address, mlir::cast<mlir::MemRefType>(unwrap(memrefType))));
}
  1. Define the nanobind build target in python/CMakeLists.txt by adding ttkernel as a dialect, and providing TTkernelModule.cpp as a source for TTMLIRPythonExtensions.Main.
# Define ttkernel dialect
declare_mlir_dialect_python_bindings(
  ADD_TO_PARENT TTMLIRPythonSources.Dialects
  ROOT_DIR "${TTMLIR_PYTHON_ROOT_DIR}"
  TD_FILE dialects/TTKernelBinding.td
  SOURCES dialects/ttkernel.py
  DIALECT_NAME ttkernel
)
  1. Create python/dialects/TTKernelBindings.td to forward the tablegen for TTKernel to the CMake dialect target:
include "ttmlir/Dialect/TTKernel/IR/TTKernelOps.td"
  1. Create nanobind module for TTKernel Dialect in python/TTMLIRModule.cpp
// Representation of the Delta you have to add to TTMLIRModule.cpp in the correct locations
NB_MODULE(_ttmlir, m) {
  m.doc() = "ttmlir main python extension";

  m.def(
      "register_dialect",
      [](MlirContext context, bool load) {
        MlirDialectHandle ttkernel_handle mlirGetDialectHandle__ttkernel__();
        mlirDialectHandleRegisterDialect(ttkernel_handle, context);
        if (load) {
          mlirDialectHandleLoadDialect(ttkernel_handle, context);
        }
      },
      py::arg("context"), py::arg("load") = true);

  auto ttkernel_ir = m.def_submodule("ttkernel_ir", "TTKernel IR Bindings");
  mlir::ttmlir::python::populateTTKernelModule(ttkernel_ir);
}
  1. Define populateTTKernelModule in python/TTKernelModule.cpp
// File: python/TTKernelModule.cpp
#include <vector>

#include "ttmlir/Bindings/Python/TTMLIRModule.h"

#include "mlir/CAPI/IR.h"
#include "ttmlir-c/TTKernelTypes.h"

#include "ttmlir/Dialect/TTKernel/IR/TTKernelOpsTypes.h"

namespace mlir::ttmlir::python {
void populateTTKernelModule(py::module &m) {
  tt_type_class<tt::ttkernel::CBType>(m, "CBType")
      .def_static("get",
                  [](MlirContext ctx, uint64_t port, uint64_t address,
                     MlirType memrefType) {
                    return ttmlirTTKernelCBTypeGet(ctx, port, address,
                                                   memrefType);
                    // Note that for more complex constructors / out of ease this could also be defined using the wrap(CBType::get) style constructor.
                  })
      .def_prop_ro("shape", [](tt::ttkernel::CBType &cb) {
            cb.getShape().vec();
        })
      .def_prop_ro("memref", &tt::ttkernel::CBType::getMemref);
}
} // namespace mlir::ttmlir::python
  1. Finally, expose the built python bindings using a "trampoline" python file in python/dialects/ttkernel.py
from ._ttkernel_ops_gen import *
from .._mlir_libs._ttmlir import register_dialect, ttkernel_ir as ir

# Import nanobind defined targets into ttkernel.ir, and the rest of the generated Ops into the top-level ttkernel python module.

Concluding The Example

While there are quite a few steps for adding a whole new dialect, often times more than not you will only need a subset of these steps to add a new Type/Attr to some existing dialect. Even less to modify the signature of some existing Type/Attr in the bindings.

Using the Python Bindings

This section will cover the basics of using the Python bindings. I think the folks at MLIR have produced documentation that can help you get up to speed quickly. This section will go over some of the nuances of using the python bindings that ttmlir has defined explicitly.

Interfacing with Generated Op Classes

The unfortunate reality is that documentation for autogenerated Ops isn't present. Fortunately, argument names are preserved and the function structure can be invoked by leveraging the help function in python. Iteratively running through the functions you want to implement can be helpful.

MLIRModuleLogger

Almost all of the python bindings behave exactly as expected coming from the ttmlir python bindings. A weird addition I think would provide some more context on nanobind and managed memory would be the MLIRModuleLogger.

This class is defined in C++ to attach to an existing MLIRContext, adding hooks to save the module to a std::vector<std::string, std::string>. Binding this forward through nanobind requires some delicacy about the state of this MLIRModuleLogger object. It needs to modify memory managed by C++, but it attaches to a context that exists in Python. This state management is done through nanobind owning and managing a thinly wrapped pointer to the C++ object by setting the return_value policy.

Using the Python bindings when traversing frequently through memory outside of the IR primitives requires some delicacy to ensure data is preserved and the code functions as intended.

Flatbuffers

Flatbuffers are the binary serialization format used by TTMLIR and they currently come in a few flavors (designated by the file extension):

  • .ttsys: A system description file that is the mechanism for supplying target information to the compiler. These can be collected on a target machine and downloaded to a development machine to enable cross-compilation.
  • .ttnn: A compiled binary file intended to be loaded and executed by the TTNN backend runtime.
  • .ttb: A compiled binary file intended to be loaded and executed by the TTMetal backend runtime (Unsupported).

CI

Our CI infrastructure is currently hosted on cloud. Cloud machines are used and linked as GitHub runners.

CI is triggered by new pull request and on push into main (usually when PR is merged).

CI is designed to automatically collect analytics data for each workflow run, including test results and code coverage. It will also publish newest release of documentation on GitHub.

Builds

CI performs the following build jobs:

  • Release build "speedy" - release image optimized for speed.
  • Release build "tracy" - release image with runtime trace/debug capabilities including performance measurments.
  • Debug build with unit tests and test coverage.
  • CLang tidy
  • ...and Tests

The build of tt-mlir is done using build-tt-mlir-action. Only the Debug build has a specific implementation because it is also used to run unit tests and collect and publish code coverage data. Code coverage is published on codecov along with its results and a link to detailed coverage information is attached as a comment to PR. Test results are published as workflow artifacts in raw format and as HTML test reports, where applicable. Currently, there are no plans to change the build process, except minor parameter modifications or added features such as release wheel publishing to tt-forge.

Testing

Testing is performed inside build-and-test.yml workflow as run-tests jobs. It uses a matrix strategy which means that multiple jobs are created and executed on multiple machines using the same job task.

Test Matrix

Each row in the matrix array represents one test that will execute on a specific machine. Example:

 {runs-on: n150,   name: "run",  suite: "runtime_debug",     image: "tracy",  type: "ttrt",    path: "Silicon", flags: "--non-zero", container-options: "--device /dev/tenstorrent/0"},
 {runs-on: llmbox, name: "perf", suite: "perf",              image: "tracy",  type: "ttrt",    path: "Silicon/TTNN/llmbox/perf", container-options: "--device /dev/tenstorrent/0 --device /dev/tenstorrent/1 --device /dev/tenstorrent/2 --device /dev/tenstorrent/3"},
 {runs-on: n150,   name: "perf", suite: "explorer",          image: "tracy",  type: "pytest",  path: "tools/explorer/test/run_tests.py", container-options: "--device /dev/tenstorrent/0"},

runs-on

Specify the machine on which the test suite will be executed. Currently supported runners are:

  • N150
  • N300
  • NXX0 - either N150 or N300
  • llmbox
  • tg - galaxy box

It is expected that list will expand soon as machines with blackhole chip family are added to the runner pool.

name

"name" has historic origins in its name. In reality it is the type of test to perform:

  • run - perform functional run, or just run tests
  • perf - collect performance data (and send them to analytics)

path

This field represents the path inside the tt-mlir repository where your tests resides. For ttrt test this is the relative path for generated mlir files inside the build/test/ttmlir directory. For pytest the path is relative to the repository root.

suite

This is the actual test name.

image

Specify which release build image to use. It can be:

  • speedy
  • tracy

Please take a look at the Builds section for a more detailed description of the builds.

type

Specify the type of test run. Currently supported:

  • pytest - run python tests using pytest
  • ttrt - run tests using ttrt tool
  • unit - run unit tests
  • builder - run builder tests and execute generated flatbuffers iff run-ttrt flag is set
  • ttnn_standalone - run ttnn_standalone sanity test
  • pykernel - run pykernel tests and runtime demo.

flags (optional)

Additional flags may be used when running tests. These are passed to ttrt or pytest as an additional parameter.

container-options (optional)

Each test runs in a docker container and this option specifies docker container options. It is mostly used to map TT hardware device to a docker container (for example: "--device /dev/tenstorrent/0"). If no value is passed, the default value will be used ("--device /dev/tenstorrent")

Adding New Test

Usually, it is enough to add a single line to the test matrix and your tests will become part of tt-mlir CI. Here is a checklist of what you should decide before adding it:

  • On which TT hardware should your tests should run? Put the specific hardware in "runs-on" field or NXX0 if you don't care. If you want your test to run on multiple hardware types add multiple lines to the matrix, one for each hardware type.
  • Are your test run with ttrt or pytest? Put this decision in "type" field.
  • Does your test generate performance report? If it does put name as "perf". If not put name as "run".
  • Use creativity and name your test. Write result of your hard intellectual work inside "suite" field.

Each line in matrix MUST be unique! Check if it is. If it is not, use more of your creative and intellectual energy to create better (at least different) name for "suite" field.

Consider

Here are few things to consider:

  • Design your ttrt test so it is generated with a -- check_ttmlir CMake target.
  • For pytest, use pytest test discovery to run all tests in subdirectories. In most cases there is no need for two sets of tests.
  • If you want to have separate test reports, do not add additional XML file paths and steps to upload these. Use test_report_path because it will be automatically picked up and sent to analytics.
  • If separate reports are required, treat them as different tests. Add an additional line to the test matrix.
  • If you need to add additional steps to the run-tests job, make sure it's necessary. Typically, it's not a good idea to add additional steps. If there's another way to achieve your goal, use that method instead. This is because each step is executed for each test in the test matrix. When you add additional steps your test might pass, but many other tests will randomly fail.

Additional Reading

This section contains pointers to reading material that may be useful for understanding the project.

MLIR

  • https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index.html
  • https://mlir.llvm.org/docs/Tutorials/Toy/
  • https://www.jeremykun.com/2023/08/10/mlir-getting-started/
  • https://arxiv.org/pdf/2002.11054
  • https://ieeexplore.ieee.org/abstract/document/9370308

Dialects

Tablegen

LLVM Testing Framework Tools

Jax
Flatbuffer
Openxla Website
openxla
StableHLO

Contributor Covenant Code of Conduct

Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

Our Standards

Examples of behavior that contributes to a positive environment for our community include:

  • Demonstrating empathy and kindness toward other people
  • Being respectful of differing opinions, viewpoints, and experiences
  • Giving and gracefully accepting constructive feedback
  • Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
  • Focusing on what is best not just for us as individuals, but for the overall community

Examples of unacceptable behavior include:

  • The use of sexualized language or imagery, and sexual attention or advances of any kind
  • Trolling, insulting or derogatory comments, and personal or political attacks
  • Public or private harassment
  • Publishing others' private information, such as a physical or email address, without their explicit permission
  • Other conduct which could reasonably be considered inappropriate in a professional setting

Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.

Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.

Scope

This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at nsmith@tenstorrent.com or staylor@tenstorrent.com. All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the reporter of any incident.

Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:

1. Correction

Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.

Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.

2. Warning

Community Impact: A violation through a single incident or series of actions.

Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.

3. Temporary Ban

Community Impact: A serious violation of community standards, including sustained inappropriate behavior.

Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.

4. Permanent Ban

Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.

Consequence: A permanent ban from any sort of public interaction within the community.

Attribution

This Code of Conduct is adapted from the Contributor Covenant, version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.

Community Impact Guidelines were inspired by Mozilla's code of conduct enforcement ladder.

For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.

Project Structure

  • env: Contains the environment setup for building project dependencies, such as LLVM and Flatbuffers
  • include/ttmlir: Public headers for the TTMLIR library
    • Dialect: MLIR dialect interfaces and definitions, dialects typically follow a common directory tree structure:
      • IR: MLIR operation/type/attribute interfaces and definitions
      • Passes.[h|td]: MLIR pass interfaces and definitions
      • Transforms: Common MLIR transformations, typically invoked by passes
    • Target: Flatbuffer schema definitions. This defines the binary interface between the compiler and the runtime
  • lib: TTMLIR library implementation
    • CAPI: C API for interfacing with the TTMLIR library, note this is needed for implementing the python bindings. Read more about it here: https://mlir.llvm.org/docs/Bindings/Python/#use-the-c-api
    • Dialect: MLIR dialect implementations
  • runtime: Device runtime implementation
    • include/tt/runtime: Public headers for the runtime interface
    • lib: Runtime implementation
    • tools/python: Python bindings for the runtime, currently this is where ttrt is implemented
  • test: Test suite
  • tools/ttmlir-opt: TTMLIR optimizer driver

Namespaces

  • mlir: On the compiler side, we use the MLIR namespace for all MLIR types and operations and subnamespace for our dialects.
    • mlir::tt: Everything ttmlir related is underneath this namespace. Since we need to subnamespace under mlir, just mlir::tt seemed better than mlir::ttmlir which feels redundant.
      • mlir::tt::ttir: The TTIR dialect namespace
      • mlir::tt::ttnn: The TTNN dialect namespace
      • mlir::tt::ttmetal: The TTMetal dialect namespace
      • mlir::tt::ttkernel: The TTKernel dialect namespace
  • tt::runtime: On the runtime side, we use the tt::runtime namespace for all runtime types and operations.
    • tt::runtime::ttnn: The TTNN runtime namespace
    • tt::runtime::ttmetal: The TTMetal runtime namespace (not implemented)

Dialects Overview

Here is a brief overview of the dialects in the project, please refer to the individual dialect documentation for more details.:

  • tt: Common types such as, tt.tile, tt.metal_layout, tt.grid, etc. and enums such as, data formats, memory spaces, iterator types etc.
  • ttir: A high level dialect that models the tensor compute graph on tenstorrent devices. Accepts tosa and linalg input.
    • ttir.generic: Generically describe compute work.
    • ttir.to_layout: Convert between different tensor memory layouts and transfer between different memory spaces.
    • tensor.pad: Pad a tensor with a value (ie. convs)
    • ttir.yield: return result memref of computation in dispatch region body, lowers to ttkernel.yield
    • ttir.kernel: lowers to some backend kernel
  • ttnn: A TTNN dialect that models ttnn API.
  • ttkernel: Tenstorrent kernel library operations.
    • ttkernel.noc_async_read
    • ttkernel.noc_async_write
    • ttkernel.cb_push_back
    • ttkernel.[matmul|add|multiply]: Computations on tiles in source register space, store the result in dest register space.
    • ttkernel.sfpu_*: Computations on tiles in dest register space using sfpu coprocessor.
  • ttmetal: Operations that dispatch work from host to device.
    • ttmetal.enqueue_program: Dispatch a grid of compute work.

Guidelines

This page contains a collection of guidelines to help maintain consistency and quality across our project. Please refer to the following documents for detailed instructions on coding practices, as well as specific dialect guidelines.

TT-MLIR Coding Guidelines

This document outlines the coding standards used in the tt-mlir project. These guidelines are designed to enhance the readability and maintainability of our shared codebase. While these guidelines are not strict rules for every situation, they are essential for maintaining consistency across the repository.

Our long-term aim is to have the entire codebase adhere to these conventions.

Since our compiler is built on the LLVM MLIR framework, we strive to align closely with the LLVM coding style guidelines outlined here: LLVM Coding Standards.

Naming

Clear and descriptive names are crucial for code readability and preventing bugs. It’s important to choose names that accurately reflect the semantics and purpose of the underlying entities, within reason. Avoid abbreviations unless they are widely recognized. Once you settle on a name, ensure consistent capitalization throughout the codebase to avoid confusion.

The general naming rule is to use camel case for most names (for example, WorkaroundPass, isRankedTensor())

  • Type Names
    • Applies to classes, structs, enums, and typedefs.
    • Should be nouns that describe the entity's purpose.
    • Use upper camel case (for example, TTNNOptimizerOptions, DecompositionPass).
  • Variable Names
    • Should be nouns, as they represent state.
    • Use lower camel case (for example, inputLayout).
  • Function Names
    • Represent actions and should be verb phrases
    • Use lower camel case (for example, createTTNNOptimizer(), emitTTNNAsCpp()).

Includes

We prefer #includes to be listed in this order:

  1. Main Module Header
  2. Local/Private Headers
  3. LLVM project/subproject headers (clang/..., lldb/..., llvm/..., etc)
  4. System #includes

Each category should:

  • Be sorted lexicographically by the full path.
  • Be separated by a single blank line for clarity.

Only the standard lib header includes should use <> whereas all the others should use quotes "". Additionally, all project headers must use absolute paths (rooted at ttmlir) to prevent preprocessor and namespacing issues. For example, the following is preferred:

#include "ttmlir/module/something.h"

over:

#include "something.h"

Using TTIRToTTNN.cpp as an example, this is what includes would look like for us:

#include "ttmlir/Conversion/TTIRToTTNN/TTIRToTTNN.h"  # main header

#include "ttmlir/Dialect/TT/IR/TTOpsTypes.h"  # these are local/private headers
#include "ttmlir/Dialect/TTNN/Utils/Utils.h"

#include "mlir/Dialect/MemRef/IR/MemRef.h"  # llvm project/subproj headers
#include "llvm/Support/LogicalResult.h"

#include <cstdio>  # system includes
#include <algorithm>

Comments

Write comments as full sentences, starting with a capital letter and ending with a period. Comments should explain why the code exists, not just what it does. Use comments to clarify logic, assumptions, or any non-obvious aspects of the code.

Example of a comment:

// Initialize the buffer to store incoming data from the network.

In general, C++ style comments (//) should be used. Use C-style comments (/**/) only for when documenting the significance of constants used as actual parameters in a call:

object.callFunction(/*arg0=*/nullptr);

Every function, class, or non-trivial piece of logic should have a comment. Avoid redundant comments for self-explanatory code, but never leave complex code unexplained. Example of redundant comment:

// Increment the counter by 1.  // Redundant, avoid.
counter++;

Ensure comments are accurate and reflect the current state of the code. Outdated or misleading comments can be worse than no comments at all.

Code Denesting (Inversion)

Strive to minimize unnecessary indentation without compromising code clarity. One effective way to achieve this is by using early exits and the continue keyword in long loops.

Consider following example:

void doSomething(Operation *op)
{
    if (op->getNumOperands() > 0
        && isDpsOp(op)
        && doSomethingDifferent(op))
    {
        // ... some long code ...
    }
}

It is strongly recommended to format the code as follows:

void doSomething(Operation *op)
{
    // ...
    // We need to do something with the op that has more than 0 operands
    if (op->getNumOperands() <= 0 ) return;

    // We need something to do with the DPS op
    if (!isDpsOp(op)) return;

    // Just for example purposes
    if (!doSomethingDifferent(op)) return;

    // .. some long code ...
}

This reduces loop nesting, makes the reasoning behind the conditions clearer, and signals to the reader that there is no subsequent else to worry about, reducing cognitive load. This can significantly improve code readability and comprehension.

Function Declaration and Definition Order

To improve code readability and maintainability, we should adopt a consistent approach for organizing function declarations and definitions within a file. The goal is to make it easier for readers to follow the logical flow of function dependencies.

Follow a bottom-up call order:

  • Arrange functions so that lower-level helper functions are defined first, followed by higher-level functions that call them.
  • This allows each function to be defined after its dependencies, making it clear which functions rely on which.
  • For example, if function A calls A1 and A2, then the preferred order is:
void A1();
void A2();
void A(){
  A1();
  A2();
}

Group related functions together:

  • If functions are only relevant to a specific “parent” function (for example, A1 and A2 are only called by A), place them directly before the “parent” function.
  • If a function (like A2) is also called by other functions (for example, B), place it where it fits the overall bottom-up order.

Avoid mixed ordering:

  • Mixing top-down and bottom-up call orders within the same file can make the code hard to read and maintain.

Example of a preferred order:

void A1() {
  /*...*/
}
void A2() {
  /*...*/
}
void B() {
  A2(); // A2 is defined before B, so dependencies are clear.
}
void A() {
  A1();
  A2();
  B();
}

Helper Functions

These coding guidelines address visibility and linkage of simple helper functions to ensure clarity, prevent linking errors, and improve maintainability:

  • If a helper function needs to be defined in a .cpp file, it should be declared static or wrapped inside an anonymous namespace.

  • If a helper function needs to be defined in a header file (for example, for templated or performance-critical code), it should be marked as inline.

[!NOTE] A significant concern with declaring functions as non-public (for example, static functions or functions in unnamed namespaces) is that they cannot be unit tested in isolation. This limitation hinders our ability to write focused, granular tests that verify the correctness of individual components and it also reduces test coverage.

Using Namespaces

Namespaces are an important part of C++ programming, providing a way to organize code and avoid naming conflicts. Choose namespace names that reflect the purpose or functionality of the code contained within.

Follow these guidelines when defining namespaces:

  • Use lower-case letters for short, single-word names or those with a clear acronym (for example, ttnn, mlir).
  • Use nested namespaces to group logically related code, avoiding too deep or unnecessarily complex hierarchy

Follow these guidelines when using namespaces:

  • Do not use a using-directive to make all names from a namespace available because it pollutes the namespace.
// Forbidden -- This pollutes the namespace.
using namespace std;
  • Avoid placing code in the global namespace to reduce the potential for name conflicts and ambiguity. Always use specific namespaces. If necessary to use something from the global namespace (such as std), use an explicit std:: prefix rather than importing everything using using namespace std;.
  • Do not use namespace aliases at namespace scope in header files except in explicitly marked internal-only namespaces, because anything imported into a namespace in a header file becomes part of the public API exported by that file.
  • Try to avoid mixing concepts from different namespaces in a single function or class. If a function belongs to one namespace but calls classes from others, ensure the relationships are clear.
  • Wrap classes/structs declared in .cpp files inside of an anonymous namespace to avoid violating ODR. See LLVM docs for more detailed information.

Using Alternative Tokens (and, or, xor, etc.)

Although they are standard, we should avoid their use. They are very rarely used in practice and the C++ community widely uses the standard operators (&&, ||, !, etc.), as they are more familiar and easily recognizable to most C++ developers. Their usage can make the code harder to read and maintain, especially for developers who are not familiar with these alternatives. We should stick to the standard operators (&&, ||, !, etc.) for clarity, consistency, and compatibility with other C++ developers and tools.

Type Aliasing

When declaring type aliases in C++ prefer using over typedef. using provides better readability, especially for complex types, and supports alias templates. Here is an example:

// Preferred
using Callback = void(*)(int, double);

// Avoid
typedef void (*Callback)(int, double);

Choose alias names that clarify their role in the code. Avoid overly generic names that might obscure the type’s purpose. Do not create a type alias unless it significantly improves clarity or simplifies complex types.

Using auto to Deduce Type

Use auto only when it enhances code readability or maintainability. Avoid defaulting to “always use auto.” Instead, apply it thoughtfully in the following scenarios:

  • When the type is immediately clear from the initializer, such as in cast(...).
  • When the type is obvious from the context, making the code cleaner and more concise.
  • When the type is already abstracted, such as with container typedefs like std::vector::iterator.

In all other cases, prefer explicit type declarations to maintain clarity and ensure the code remains easy to understand.

Python Coding Guidelines

Python Version and Source Code Formatting

The current minimum version of Python required is 3.10 or higher. Python code in the tt-mlir repository should only use language features available in this version of Python.

The Python code within the tt-mlir repository should adhere to the formatting guidelines outlined in PEP 8.

For consistency and to limit churn, code should be automatically formatted with the black utility, which is PEP 8 compliant. Use its default rules. For example, avoid specifying --line-length even though it does not default to 80. The default rules can change between major versions of black. In order to avoid unnecessary churn in the formatting rules, we currently use black version 23.x.

When contributing a patch unrelated to formatting, you should format only the Python code that the patch modifies. When contributing a patch specifically for reformatting Python files, use black, which currently only supports formatting entire files.

Here is a quick example, but see the black documentation for details:

$ black test.py                    # format entire file

TTNN Dialect Contribution Guidelines

This document provides clear and consistent guidelines for contributing to the TTNN dialect, including operations, attributes, types, and other components. Following these ensures a streamlined development process, faster code reviews, and higher-quality code with fewer bugs.

General Principle: Model TTNN Library Closely

The TTNN dialect should closely reflect the TTNN library wherever practical, serving as the core guiding principle when contributing to the dialect. Whenever there's a need to deviate from this principle, it should be discussed with stakeholders.

Ops and Operands

Signature Selection

Ops in TTNN may have multiple signatures available - it's important to choose the right one when creating its model in the TTNN dialect. Going through an example, these are the available signatures for the ttnn::transpose op:

struct ExecuteTranspose {
    static ttnn::Tensor invoke(
        uint8_t queue_id,
        const ttnn::Tensor& input_tensor,
        const int64_t& dim1,
        const int64_t& dim2,
        const std::optional<MemoryConfig>& memory_config_arg,
        const std::optional<float>& pad_value = 0.0f);

    static ttnn::Tensor invoke(
        const ttnn::Tensor& input_tensor,
        const int64_t& dim1,
        const int64_t& dim2,
        const std::optional<MemoryConfig>& memory_config,
        const std::optional<float>& pad_value = 0.0f);

    static ttnn::Tensor invoke(
        const ttnn::Tensor& input_tensor,
        const int64_t& dim1,
        const int64_t& dim2,
        const std::optional<float>& pad_value = 0.0f);
};

The first and second signature differ only in the queue_id parameter - we don't model queues today, so the second signature has priority here. The second and third signature differ in memory_config parameter - the second signature is preferred as it is more robust: the parameter is optional so it can remain unused if it isn't needed.

Only one signature should be chosen. If the need would arise for more than one signature, it would be a precedent, and should be discussed with stakeholders.

Operand ordering

Operands in the TTNN dialect ops should match the ordering of the signature of the op being modelled. For the chosen signature of the ttnn::transpose op, the operands should look like this:

let arguments = (ins AnyRankedTensor:$input,
                     SI64Attr:$dim0,
                     SI64Attr:$dim1,
                     OptionalAttr<TTNN_MemoryConfigAttr>:$memory_config,
                     OptionalAttr<FloatAttr>:$pad_value);

Mixing types and attributes within the ordering is not an issue, this is valid:

let arguments = (ins TTNN_ShapeAttr:$shape,
                     OptionalAttr<TT_DataTypeAttr>:$dtype,
                     OptionalAttr<TTNN_LayoutAttr>:$layout,
                     Optional<TT_Device>:$device,
                     OptionalAttr<TTNN_MemoryConfigAttr>:$memory_config);

Following this guideline provides consistency with the TTNN lib.

Optional operands

If an operand is optional in the TTNN lib, it should be modelled as optional in the dialect.

Default-valued operands

If an operand has a default value in the TTNN lib, it should have a default value in the dialect.

ttnn::permute as an example:

static ttnn::Tensor invoke(
    const ttnn::Tensor& input_tensor,
    tt::stl::Span<const int64_t> dims,
    const std::optional<MemoryConfig>& memory_config,
    const std::optional<float>& pad_value = 0.0f);
let arguments = (ins AnyRankedTensor:$input,
                     DenseI64ArrayAttr:$permutation,
                     OptionalAttr<TTNN_MemoryConfigAttr>:$memory_config,
                     DefaultValuedOptionalAttr<F32Attr, "0.0f">:$pad_value);

Numerical operands

Numerical operands should match in signedness and bit width. If an operand is a signed integer of width of 32 bits, SI32Attr should be used to model it.

Pointers and references

Pointers and references should be ignored. We do not want to model this level of detail at this point in time.

There were very few issues with these previously, and they were caused by inconsistencies in TTNN lib APIs.

Attrs vs Types

General guideline is that if a value is known at compile time, it should probably be an Attr. Example: dims in transpose op, pooling windows in a conv, etc. If the value is unknown at compile time (e.g. tensor) it should be a Type.

There's another consideration to account for: does the value need its own SSA? Remember, Attrs need something to latch onto, like an op or a Type, but Types need to be constructed, i.e. have their own SSA, in order to exist. Let's look at ttnn::Shape for example - in TTNN lib, these need to be constructed, so it naturally follows that they should have their own SSA value within the IR, implying that they should be implemented as Types. However, there are several downsides to this:

  • More IR is produced
  • Diminished readability as they're not attached to the object whose shape they're describing
  • Not as easy to construct in code
  • Runtime would need to keep track of all the Shape objects (it currently maps all SSAs, which are currently only tensors and devices)

One upside for implementing ttnn::Shape as a Type is that it would enable optimizing out multiple constructor calls for the same Shape.

It is agreed that we should prefer using Attrs in these scenarios. However, this guideline is not set in stone - stakeholders should be notified if anyone believes there's a need to implement an object as a Type.

Destination-passing style (DPS)

If the op in TTNN lib has the destination tensor, is should be modelled as DPS op.

An example signature, where the last operand is a destination tensor:

static Tensor invoke(
    const Tensor& input_tensor,
    float exponent,
    const std::optional<MemoryConfig>& memory_config = std::nullopt,
    const std::optional<Tensor>& optional_output_tensor = std::nullopt);

Variadic operands

Variadic<> type constraint should only be used for operands that are variadic in nature, e.g. a vector of tensors, like in ttnn::concat:

static ttnn::Tensor invoke(
    const std::vector<ttnn::Tensor>& input_tensors,
    int dim,
    const std::optional<MemoryConfig>& memory_config = std::nullopt,
    const std::optional<ttnn::Tensor>& optional_output_tensor = std::nullopt,
    unsigned int groups = 1);

Operand naming

Operands should be named as they are in the TTNN lib. However, this guideline is not strict, and some reasonable deviations are acceptable.

Operand namespaces

Some operands are defined in a namespace nested within the TTNN namespace, i.e. ttnn::ccl::Topology, and some are in other but related namespaces, i.e. tt::tt_metal::MemoryConfig. While it would be ideal to model these completely accurately, it doesn’t provide value and we should pretend they’re all in the ttnn:: namespace for the sake of simplicity.

Adding an Op

This guide will walk you through the process of adding a new Op end to end in tt-mlir, in this case we will be adding a matmul operation. Note that the matmul op was added as part of the same changeset as this guide, it could be useful to reference the diff alongside this guide to see the changes in full.

This guide will cover the following steps:

1. Define the Op in the TTIR frontend dialect

We will start by defining the Op in the TTIR dialect. The TTIR Ops are defined in a tablegen file located at include/ttmlir/Dialect/TTIR/IR/TTIROps.td.

Tablegen is a domain-specific language for defining ops/types/attributes in MLIR and LLVM, these definitions constitute the dialect's Operation Definition Specification (ODS).

Here is an example of defining matmul in the TTIR dialect:

def TTIR_MatmulOp : TTIR_NamedOp<"matmul"> {
    let summary = "Matrix multiplication operation.";
    let description = [{
      The `matmul` operation computes the matrix multiplication of two tensors.

      This operation performs matrix multiplication between tensors `a` and `b`. It supports optional
      transposition of either input tensor before multiplication. For 2D tensors, this computes the standard
      matrix product. For tensors with more dimensions, it applies batched matrix multiplication.

      Example:
      ```mlir
      // Basic matrix multiplication of 2D tensors
      %a = ... : tensor<3x4xf32>  // Matrix A with shape [3,4]
      %b = ... : tensor<4x5xf32>  // Matrix B with shape [4,5]
      %output = ttir.empty() : tensor<3x5xf32>  // Output matrix shape
      %result = ttir.matmul(%a, %b, %output) :
          tensor<3x4xf32>, tensor<4x5xf32>, tensor<3x5xf32> -> tensor<3x5xf32>

      // Batched matrix multiplication with transposition
      %a = ... : tensor<2x3x4xf32>  // Batch of 2 matrices with shape [3,4]
      %b = ... : tensor<2x5x4xf32>  // Batch of 2 matrices with shape [5,4]
      %output = ttir.empty() : tensor<2x3x5xf32>  // Output shape
      %result = ttir.matmul(%a, %b, %output) {
          transpose_a = false,  // Don't transpose A
          transpose_b = true    // Transpose B before multiplication
      } : tensor<2x3x4xf32>, tensor<2x5x4xf32>, tensor<2x3x5xf32> -> tensor<2x3x5xf32>
      ```

      Inputs:
      - `a` (Tensor): The first input tensor.
      - `b` (Tensor): The second input tensor.

      Attributes:
      - `transpose_a` (Boolean, default=false): Whether to transpose tensor `a` before multiplication.
      - `transpose_b` (Boolean, default=false): Whether to transpose tensor `b` before multiplication.

      Outputs:
      - `result` (Tensor): The result of the matrix multiplication.

      Note: The inner dimensions of the input tensors must be compatible for matrix multiplication.
      If `a` has shape [..., m, k] and `b` has shape [..., k, n], then the result will have shape [..., m, n].
      If `transpose_a` is true, then `a` is treated as having shape [..., k, m].
      If `transpose_b` is true, then `b` is treated as having shape [..., n, k].
    }];

    let arguments = (ins AnyRankedTensor:$a,
                         AnyRankedTensor:$b,
                         AnyRankedTensor:$output,
                         DefaultValuedAttr<BoolAttr, "false">:$transpose_a,
                         DefaultValuedAttr<BoolAttr, "false">:$transpose_b);

    let results = (outs AnyRankedTensor:$result);

    let hasVerifier = 1;

    let hasCanonicalizer = 1;
}

There are many things to break down here, starting from the top:

  • def in tablegen is used to define a concrete type, this will have a 1-1 mapping to a C++ generated class, and for this particular case the build will end up generating file build/include/ttmlir/Dialect/TTIR/IR/TTIROps.h.inc.
  • It inherits from class TTIR_DPSOp, classes in tablegen don't define a concrete type, but rather an interface that augment or constrain inherited defs. TTIR_DPSOp is a class that defines the common attributes for all TTIR Ops that implement Destination Passing Style (DPS) semantics. DPS just means that the result tensor is passed as an argument to the operation which will be critical for modeling buffer allocation / lifetimes. Note the 3rd argument AnyRankedTensor:$output.
  • Next we have a list of arguments. These arguments consist of a mixture of Types (i.e. AnyRankedTensor) and Attributes. Read more about Types & Attributes here.
    • AnyRankedTensor is part of a tablegen standard library which type aliases to MLIR's builtin Tensor type, with the added constraint that the tensor has a static rank. As much as possible we want to use the builtin types and infrastructure provided by MLIR.
  • Next we have a list of results in this case just 1, which aliases the output tensor. One drawback of DPS is that the result tensor and the output tensor will appear to have different SSA names in the IR, but they really alias the same object. This can make writing some passes more cumbersome.
  • Next we have extraClassDeclaration, which enables us to inject member functions, written directly in C++, into the generated class. We are doing this for this particular case in order to satisfy the DPS interface which requires an implementation for getting the mutated output tensor.
  • Finally, we have hasVerifier = 1, this tells MLIR that we have a verifier function that will be called to validate the operation. This is a good practice to ensure that the IR is well formed.

We can now try building and opening the TTIROps.h.inc file to see the generated C++ code. We will actually get a linker error because we have hasVerifier = 1 which automatically declared a verifier function, but we need to go implement.

Let's head over to lib/Dialect/TTIR/IR/TTIROps.cpp and implement the verifier.

// MatmulOp verification
::mlir::LogicalResult mlir::tt::ttir::MatmulOp::verify() {
  ::mlir::RankedTensorType inputAType = getA().getType();
  ::mlir::RankedTensorType inputBType = getB().getType();
  ::mlir::RankedTensorType outputType = getOutput().getType();

  llvm::ArrayRef<int64_t> outputShape = outputType.getShape();
  llvm::SmallVector<int64_t> inputAShape(inputAType.getShape());
  llvm::SmallVector<int64_t> inputBShape(inputBType.getShape());

  // Verify that the input A is at least 1D tensor.
  if (inputAType.getRank() < 1) {
    return emitOpError("Input A must be at least a 1D tensor");
  }

  // Verify that the input B is at least 1D tensor.
  if (inputBType.getRank() < 1) {
    return emitOpError("Input B must be at least a 1D tensor");
  }

  // If input A is a vector (1D tensor), 1 is prepended to its dimensions for
  // the purpose of the matrix multiplication. After the matrix multiplication,
  // the prepended dimension is removed. Otherwise, check if the LHS needs to be
  // transposed.
  if (inputAType.getRank() == 1) {
    inputAShape.insert(inputAShape.begin(), 1);
  } else if (getTransposeA()) {
    std::swap(inputAShape[inputAShape.size() - 1],
              inputAShape[inputAShape.size() - 2]);
  }

  // If input B is a vector (1D tensor), a 1 is appended to its dimensions for
  // the purpose of the matrix-vector product and removed afterwards. Otherwise,
  // check if the RHS needs to be transposed.
  if (inputBType.getRank() == 1) {
    inputBShape.push_back(1);
  } else if (getTransposeB()) {
    std::swap(inputBShape[inputBShape.size() - 1],
              inputBShape[inputBShape.size() - 2]);
  }

  // Verify that the input A and input B has matching inner dimensions.
  if (inputAShape[inputAShape.size() - 1] !=
      inputBShape[inputBShape.size() - 2]) {
    return emitOpError("Input A[-1](")
           << inputAShape[inputAShape.size() - 1] << ") and B[-2]("
           << inputBShape[inputBShape.size() - 2]
           << ") must have matching inner dimensions";
  }

  llvm::SmallVector<int64_t> expectedOutputShape;
  // Verify that the batch dimensions are broadcast compatible and construct the
  // expected output shape. If either of input A or input B is at most 2D
  // tensors, the batch dimensions are trivially broadcast compatible.
  if (inputAShape.size() > 2 || inputBShape.size() > 2) {
    llvm::SmallVector<int64_t> inputABatchDims(inputAShape.begin(),
                                               inputAShape.end() - 2);
    llvm::SmallVector<int64_t> inputBBatchDims(inputBShape.begin(),
                                               inputBShape.end() - 2);

    // Verify that the batch dimensions of input A and B are broadcast
    // compatible.
    llvm::SmallVector<int64_t, 4> broadcastedShape;
    if (!mlir::OpTrait::util::getBroadcastedShape(
            inputABatchDims, inputBBatchDims, broadcastedShape)) {

      return emitOpError("Batch dimensions of input A(" +
                         ttmlir::utils::join(inputABatchDims, ",") +
                         ") and B(" +
                         ttmlir::utils::join(inputBBatchDims, ",") +
                         ") are not broadcast compatible");
    }

    // Insert the broadcasted batch dimensions in the expected output shape.
    expectedOutputShape = std::move(broadcastedShape);
  }

  // Insert the input A and B inner dimensions in expected output shape
  // Consider the case where input A and B are vectors. In that case,
  // the dimension 1 is ommited from the output shape.
  if (inputAType.getRank() > 1) {
    expectedOutputShape.push_back(inputAShape[inputAShape.size() - 2]);
  }

  if (inputBType.getRank() > 1) {
    expectedOutputShape.push_back(inputBShape[inputBShape.size() - 1]);
  }

  // Check the case of a vector-vector product. At this moment we don't support
  // scalars in IR, hence check that the output is at least 1D tensor of size 1.
  if (expectedOutputShape.size() == 0) {
    if (outputType.getRank() < 1) {
      return emitOpError("Scalar output is not supported, output must be at "
                         "least a 1D tensor");
    }

    if (outputType.getRank() > 1 || outputType.getShape()[0] != 1) {
      return emitOpError("Scalar output must be a 1D tensor of size 1");
    }

    return success();
  }

  // Verify that the output shape is correct.
  if (outputShape.size() != expectedOutputShape.size()) {
    return emitOpError("Output shape rank(")
           << outputShape.size()
           << ") must match the expected output shape rank("
           << expectedOutputShape.size() << ")";
  }

  // Verify each dim of the output shape.
  for (auto [index, outputDim, expectedDim] : llvm::zip(
           llvm::seq(outputShape.size()), outputShape, expectedOutputShape)) {
    if (outputDim != expectedDim) {
      return emitOpError("Output shape dimension[")
             << index << "](" << outputDim
             << ") doesn't match the expected output shape dimension[" << index
             << "](" << expectedDim << ")";
    }
  }

  return success();
}

2. Define the Op in the TTNN backend dialect

Next we will define the Op in the TTNN dialect. TTNN Ops are defined in the same way, but in their respective set of dialect files. Refer to the previous section for details, the process is the same.

TTNNOps.td

def TTNN_MatmulOp : TTNN_Op<"matmul",
      [DeclareOpInterfaceMethods<TTNN_OpModelInterface, ["getOpConstraints", "getOpRuntime"]>]
      > {
    let arguments = (ins AnyRankedTensor:$a,
                         AnyRankedTensor:$b,
                         DefaultValuedAttr<BoolAttr, "false">:$transpose_a,
                         DefaultValuedAttr<BoolAttr, "false">:$transpose_b,
                         OptionalAttr<AnyAttrOf<[
                            TTNN_MatmulMultiCoreReuseProgramConfigAttr,
                            TTNN_MatmulMultiCoreReuseMultiCastProgramConfigAttr,
                            TTNN_MatmulMultiCoreReuseMultiCast1DProgramConfigAttr,
                            TTNN_MatmulMultiCoreReuseMultiCastDRAMShardedProgramConfigAttr
                         ]>>:$matmul_program_config);

    let results = (outs AnyRankedTensor:$result);

    let hasVerifier = 1;
}

TTNNOps.cpp

// MatmulOp verification
::mlir::LogicalResult mlir::tt::ttnn::MatmulOp::verify() {
  ::mlir::RankedTensorType inputAType = getA().getType();
  ::mlir::RankedTensorType inputBType = getB().getType();
  ::mlir::RankedTensorType outputType = getResult().getType();

  llvm::ArrayRef<int64_t> outputShape = outputType.getShape();
  llvm::SmallVector<int64_t> inputAShape(inputAType.getShape());
  llvm::SmallVector<int64_t> inputBShape(inputBType.getShape());

  // Verify that the input A is at least 1D tensor.
  if (inputAType.getRank() < 1) {
    return emitOpError("Input A must be at least a 1D tensor");
  }

  // Verify that the input B is at least 1D tensor.
  if (inputBType.getRank() < 1) {
    return emitOpError("Input B must be at least a 1D tensor");
  }

  // If input A is a vector (1D tensor), 1 is prepended to its dimensions for
  // the purpose of the matrix multiplication. After the matrix multiplication,
  // the prepended dimension is removed. Otherwise, check if the LHS needs to be
  // transposed.
  if (inputAType.getRank() == 1) {
    inputAShape.insert(inputAShape.begin(), 1);
  } else if (getTransposeA()) {
    std::swap(inputAShape[inputAShape.size() - 1],
              inputAShape[inputAShape.size() - 2]);
  }

  // If input B is a vector (1D tensor), a 1 is appended to its dimensions for
  // the purpose of the matrix-vector product and removed afterwards. Otherwise,
  // check if the RHS needs to be transposed.
  if (inputBType.getRank() == 1) {
    inputBShape.push_back(1);
  } else if (getTransposeB()) {
    std::swap(inputBShape[inputBShape.size() - 1],
              inputBShape[inputBShape.size() - 2]);
  }

  // Verify that the input A and input B has matching inner dimensions.
  if (inputAShape[inputAShape.size() - 1] !=
      inputBShape[inputBShape.size() - 2]) {
    return emitOpError("Input A[-1](")
           << inputAShape[inputAShape.size() - 1] << ") and B[-2]("
           << inputBShape[inputBShape.size() - 2]
           << ") must have matching inner dimensions";
  }

  llvm::SmallVector<int64_t> expectedOutputShape;
  // Verify that the batch dimensions are broadcast compatible and construct the
  // expected output shape. If either of input A or input B is at most 2D
  // tensors, the batch dimensions are trivially broadcast compatible.
  if (inputAShape.size() > 2 || inputBShape.size() > 2) {
    llvm::SmallVector<int64_t> inputABatchDims(inputAShape.begin(),
                                               inputAShape.end() - 2);
    llvm::SmallVector<int64_t> inputBBatchDims(inputBShape.begin(),
                                               inputBShape.end() - 2);

    // Verify that the batch dimensions of input A and B are broadcast
    // compatible.
    llvm::SmallVector<int64_t, 4> broadcastedShape;
    if (!OpTrait::util::getBroadcastedShape(inputABatchDims, inputBBatchDims,
                                            broadcastedShape)) {

      return emitOpError("Batch dimensions of input A(" +
                         ttmlir::utils::join(inputABatchDims, ",") +
                         ") and B(" +
                         ttmlir::utils::join(inputBBatchDims, ",") +
                         ") are not broadcast compatible");
    }

    // Insert the broadcasted batch dimensions in the expected output shape.
    expectedOutputShape = std::move(broadcastedShape);
  }

  // Insert the input A and B inner dimensions in expected output shape
  // Consider the case where input A and B are vectors. In that case,
  // the dimension 1 is ommited from the output shape.
  if (inputAType.getRank() > 1) {
    expectedOutputShape.push_back(inputAShape[inputAShape.size() - 2]);
  }

  if (inputBType.getRank() > 1) {
    expectedOutputShape.push_back(inputBShape[inputBShape.size() - 1]);
  }

  // Check the case of a vector-vector product. At this moment we don't support
  // scalars in IR, hence check that the output is at least 1D tensor of size 1.
  if (expectedOutputShape.size() == 0) {
    if (outputType.getRank() < 1) {
      return emitOpError("Scalar output is not supported, output must be at "
                         "least a 1D tensor");
    }

    if (outputType.getRank() > 1 || outputType.getShape()[0] != 1) {
      return emitOpError("Scalar output must be a 1D tensor of size 1");
    }

    return success();
  }

  // Verify that the output shape is correct.
  if (outputShape.size() != expectedOutputShape.size()) {
    return emitOpError("Output shape rank(")
           << outputShape.size()
           << ") must match the expected output shape rank("
           << expectedOutputShape.size() << ")";
  }

  // Verify each dim of the output shape.
  for (auto [index, outputDim, expectedDim] : llvm::zip(
           llvm::seq(outputShape.size()), outputShape, expectedOutputShape)) {
    if (outputDim != expectedDim) {
      return emitOpError("Output shape dimension[")
             << index << "](" << outputDim
             << ") doesn't match the expected output shape dimension[" << index
             << "](" << expectedDim << ")";
    }
  }

  return success();
}

For more details on adding ops to the TTNN dialect, refer to TTNN Dialect Contribution Guidelines.

3. Convert / Implement the Op in the TTNN passes

TTIR to TTNN

Next we will implement the conversion from the TTIR matmul Op to the TTNN matmul Op. This is a trivial conversion, as the Ops are identical in their semantics, so the changeset isn't going to be very instructive, but will at least point to the files involved. The conversion is implemented in the ConvertTTIRToTTNNPass pass in file lib/Conversion/TTIRToTTNN/TTIRToTTNNPass.cpp.

Zooming into class ConvertTTIRToTTNNPass we can see we implement the pass interface via member function void runOnOperation() final. This function will be called for every operation matching the type specified in the pass tablegen file. A quick look at include/ttmlir/Conversion/Passes.td we can see:

def ConvertTTIRToTTNN: Pass<"convert-ttir-to-ttnn", "::mlir::ModuleOp"> {

This means that runOnOperation will be called for every ModuleOp in the graph, usually there is only one ModuleOp which serves as the root of the graph.

Inside runOnOperation is usually where we define a rewrite pattern set that can match much more complicated patterns (nested inside of the ModuleOp's regions) than just a single operation. In runOperation method you will see the call to method populateTTIRToTTNNPatterns(...) that actually generates rewrite patterns. Method populateTTIRToTTNNPatterns(...) is defined in lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp.

  patterns
      .add<TensorEmptyConversionPattern,
           NamedFullConversionPattern<ttir::ZerosOp, ttnn::ZerosOp>,
           NamedFullConversionPattern<ttir::OnesOp, ttnn::OnesOp>,
           FullOpConversionPattern,
           ToLayoutOpConversionPattern,
           QuantizationOpConversionPattern<ttir::QuantizeUnrolledOp, ttnn::QuantizeOp>,
           QuantizationOpConversionPattern<ttir::DequantizeUnrolledOp, ttnn::DequantizeOp>,
           RequantizeOpConversionPattern,
           ElementwiseOpConversionPattern<ttir::AbsOp, ttnn::AbsOp>,
           ElementwiseOpConversionPattern<ttir::AddOp, ttnn::AddOp>,
           ElementwiseOpConversionPattern<ttir::CbrtOp, ttnn::CbrtOp>,
           ElementwiseOpConversionPattern<ttir::FloorOp, ttnn::FloorOp>,
           ElementwiseOpConversionPattern<ttir::IsFiniteOp, ttnn::IsFiniteOp>,
           ElementwiseOpConversionPattern<ttir::LogicalAndOp, ttnn::LogicalAndOp>,
           ElementwiseOpConversionPattern<ttir::LogicalOrOp, ttnn::LogicalOrOp>,
           ElementwiseOpConversionPattern<ttir::LogicalNotOp, ttnn::LogicalNotOp>,
           ElementwiseOpConversionPattern<ttir::LogicalXorOp, ttnn::LogicalXorOp>,
           ElementwiseOpConversionPattern<ttir::BitwiseAndOp, ttnn::BitwiseAndOp>,
           ElementwiseOpConversionPattern<ttir::BitwiseOrOp, ttnn::BitwiseOrOp>,
           ElementwiseOpConversionPattern<ttir::BitwiseXorOp, ttnn::BitwiseXorOp>,
           ElementwiseOpConversionPattern<ttir::BitwiseNotOp, ttnn::BitwiseNotOp>,
           ElementwiseOpConversionPattern<ttir::MultiplyOp, ttnn::MultiplyOp>,
           ElementwiseOpConversionPattern<ttir::EqualOp, ttnn::EqualOp>,
           ElementwiseOpConversionPattern<ttir::NotEqualOp, ttnn::NotEqualOp>,
           ElementwiseOpConversionPattern<ttir::GreaterEqualOp, ttnn::GreaterEqualOp>,
           ElementwiseOpConversionPattern<ttir::GreaterThanOp, ttnn::GreaterThanOp>,
           ElementwiseOpConversionPattern<ttir::LessEqualOp, ttnn::LessEqualOp>,
           ElementwiseOpConversionPattern<ttir::LessThanOp, ttnn::LessThanOp>,
           ElementwiseOpConversionPattern<ttir::MaximumOp, ttnn::MaximumOp>,
           ElementwiseOpConversionPattern<ttir::MinimumOp, ttnn::MinimumOp>,
           ElementwiseOpConversionPattern<ttir::NegOp, ttnn::NegOp>,
           ElementwiseOpConversionPattern<ttir::ReluOp, ttnn::ReluOp>,
           ElementwiseOpConversionPattern<ttir::GeluOp, ttnn::GeluOp>,
           ElementwiseOpConversionPattern<ttir::SqrtOp, ttnn::SqrtOp>,
           ElementwiseOpConversionPattern<ttir::RsqrtOp, ttnn::RsqrtOp>,
           ElementwiseOpConversionPattern<ttir::SignOp, ttnn::SignOp>,
           ElementwiseOpConversionPattern<ttir::SigmoidOp, ttnn::SigmoidOp>,
           ElementwiseOpConversionPattern<ttir::Log1pOp, ttnn::Log1pOp>,
           ElementwiseOpConversionPattern<ttir::ReciprocalOp, ttnn::ReciprocalOp>,
           ElementwiseOpConversionPattern<ttir::ExpOp, ttnn::ExpOp>,
           ElementwiseOpConversionPattern<ttir::ErfOp, ttnn::ErfOp>,
           ElementwiseOpConversionPattern<ttir::ErfcOp, ttnn::ErfcOp>,
           ElementwiseOpConversionPattern<ttir::LogOp, ttnn::LogOp>,
           ElementwiseOpConversionPattern<ttir::DivOp, ttnn::DivideOp>,
           ElementwiseOpConversionPattern<ttir::CeilOp, ttnn::CeilOp>,
           ElementwiseOpConversionPattern<ttir::SinOp, ttnn::SinOp>,
           ElementwiseOpConversionPattern<ttir::CosOp, ttnn::CosOp>,
           ElementwiseOpConversionPattern<ttir::Expm1Op, ttnn::Expm1Op>,
           ElementwiseOpConversionPattern<ttir::RemainderOp, ttnn::RemainderOp>,
           ElementwiseOpConversionPattern<ttir::WhereOp, ttnn::WhereOp>,
           ElementwiseOpConversionPattern<ttir::TanOp, ttnn::TanOp>,
           ElementwiseOpConversionPattern<ttir::TanhOp, ttnn::TanhOp>,
           ElementwiseOpConversionPattern<ttir::AtanOp, ttnn::AtanOp>,
           ElementwiseOpConversionPattern<ttir::Atan2Op, ttnn::Atan2Op>,
           ElementwiseOpConversionPattern<ttir::PowOp, ttnn::PowOp>,
           Pooling2dOpConversionPattern<ttir::MaxPool2dOp, ttnn::MaxPool2dOp>,
           Pooling2dOpConversionPattern<ttir::AvgPool2dOp, ttnn::AvgPool2dOp>,
           ReductionOpConversionPattern<ttir::SumOp, ttnn::SumOp>,
           ReductionOpConversionPattern<ttir::MeanOp, ttnn::MeanOp>,
           ReductionOpConversionPattern<ttir::MaxOp, ttnn::MaxOp>,
           ReductionOpConversionPattern<ttir::MinOp, ttnn::MinOp>,
           ReductionProdOpConversionPattern,
           ReductionArgMaxOpConversionPattern,
           ElementwiseUnaryWithFloatParameterOpConversionPattern<ttir::LeakyReluOp, ttnn::LeakyReluOp>,
           BroadcastOpConversionPattern,
           PadOpConversionPattern,
           EmbeddingOpConversionPattern,
           EmbeddingBackwardOpConversionPattern,
           RepeatOpConversionPattern,
           CumSumOpConversionPattern,
           RepeatInterleaveOpConversionPattern,
           SoftmaxOpConversionPattern,
           TransposeOpConversionPattern,
           TypecastOpConversionPattern,
           ClampOpConversionPattern<ttir::ClampScalarOp, ttnn::ClampScalarOp>,
           ClampOpConversionPattern<ttir::ClampTensorOp, ttnn::ClampTensorOp>,
           ConcatOpConversionPattern,
           ReshapeOpConversionPattern,
           SliceOpConversionPattern,
           SqueezeOpConversionPattern,
           UnsqueezeOpConversionPattern,
           ConstantOpConversionPattern,
           LinearOpConversionPattern,
           BatchNormOpConversionPattern,
           MatmulOpConversionPattern,
           Conv2dOpConversionPattern,
           ConvTranspose2dOpConversionPattern,
           SubtractOpConversionPattern,
           MeshShardOpConversionPattern,
           AllReduceOpConversionPattern,
           AllGatherOpConversionPattern,
           ReduceScatterOpConversionPattern,
           CollectivePermuteOpConversionPattern,
           ArangeOpConversionPattern,
           UpdateCacheOpConversionPattern,
           FillCacheOpConversionPattern,
           ScatterOpConversionPattern,
           PermuteOpConversionPattern,
           UpsampleOpConversionPattern
           >(typeConverter, ctx);

More information on rewrite patterns and their capabilities can be found in the MLIR documentation here and here.

For matmul, we defined a new conversion pattern that's generic to all binary ops with arguments named a and b:

namespace {
class MatmulOpConversionPattern : public OpConversionPattern<ttir::MatmulOp> {
public:
  using OpConversionPattern<ttir::MatmulOp>::OpConversionPattern;

  LogicalResult
  matchAndRewrite(ttir::MatmulOp op, OpAdaptor adaptor,
                  ConversionPatternRewriter &rewriter) const override {
    rewriter.replaceOpWithNewOp<ttnn::MatmulOp>(
        op, this->getTypeConverter()->convertType(op.getType()), adaptor.getA(),
        adaptor.getB(), adaptor.getTransposeA(), adaptor.getTransposeB(),
        nullptr);
    return success();
  }
};
} // namespace

Invoked as part of the rewrite set:

MatmulOpConversionPattern

TTNN to EmitC

Similarly, we also need to add a pattern to convert from TTNN dialect to EmitC dialect.

Method to populate rewrite patterns can be found in lib/Conversion/TTNNToEmitC/TTNNToEmitC.cpp:

void populateTTNNToEmitCPatterns(mlir::MLIRContext *ctx,
                                 mlir::RewritePatternSet &patterns,
                                 TypeConverter &typeConverter) {
  // Device ops
  //
  patterns.add<TTDeviceOpConversionPattern>(typeConverter, ctx);
  patterns.add<GetDeviceOpConversionPattern>(typeConverter, ctx);

  // Memory ops
  //
  // clang-format off
  patterns.add<ToLayoutOpConversionPattern,
               ToMemoryConfigOpConversionPattern,
               ToDTypeOpConversionPattern,
               TypecastOpConversionPattern,
               ToDeviceOpConversionPattern,
               FromDeviceOpConversionPattern,
               DeallocateOpConversionPattern>(typeConverter, ctx);
  // clang-format on

  // Tensor ops
  //
  // clang-format off
  patterns.add<EmptyOpConversionPattern,
               NamedFullOpConversionPattern<tt::ttnn::ZerosOp>,
               NamedFullOpConversionPattern<tt::ttnn::OnesOp>,
               FullOpConversionPattern,
               DefaultOpConversionPattern<tt::ttnn::ArangeOp>,
               DefaultOpConversionPattern<tt::ttnn::ConstantOp>>(typeConverter, ctx);
  // clang-format on

  // Eltwise unary ops
  //
  patterns
      .add<EltwiseUnaryOpConversionPattern<tt::ttnn::AbsOp>,
           EltwiseUnaryCompositeOpConversionPattern<tt::ttnn::CbrtOp>,
           ClampOpConversionPattern<tt::ttnn::ClampScalarOp>,
           ClampOpConversionPattern<tt::ttnn::ClampTensorOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::FloorOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::IsFiniteOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::LogicalNotOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::BitwiseNotOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::NegOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::ReluOp>,
           ElementwiseUnaryWithFloatParameterOpConversionPattern<
               tt::ttnn::LeakyReluOp>,
           EltwiseUnaryWithFastAndApproximateModeOpConversionPattern<
               tt::ttnn::GeluOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::SqrtOp>,
           EltwiseUnaryWithFastAndApproximateModeOpConversionPattern<
               tt::ttnn::RsqrtOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::SignOp>,
           EltwiseUnaryWithVectorAndFastAndApproximateModeOpConversionPattern<
               tt::ttnn::SigmoidOp>,
           EltwiseUnaryCompositeOpConversionPattern<tt::ttnn::Log1pOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::ReciprocalOp>,
           EltwiseUnaryWithFastAndApproximateModeOpConversionPattern<
               tt::ttnn::ExpOp>,
           EltwiseUnaryWithFastAndApproximateModeOpConversionPattern<
               tt::ttnn::ErfOp>,
           EltwiseUnaryWithFastAndApproximateModeOpConversionPattern<
               tt::ttnn::ErfcOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::CeilOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::SinOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::CosOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::Expm1Op>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::TanOp>,
           EltwiseUnaryWithAccuracyModeOpConversionPattern<tt::ttnn::TanhOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::AtanOp>,
           EltwiseUnaryOpConversionPattern<tt::ttnn::LogOp>>(typeConverter,
                                                             ctx);

  // Eltwise binary ops
  //
  patterns
      .add<EltwiseBinaryOpConversionPattern<tt::ttnn::AddOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::SubtractOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::MultiplyOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::LogicalAndOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::LogicalOrOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::LogicalXorOp>,
           EltwiseBinaryCompositeOpConversionPattern<tt::ttnn::BitwiseAndOp>,
           EltwiseBinaryCompositeOpConversionPattern<tt::ttnn::BitwiseOrOp>,
           EltwiseBinaryCompositeOpConversionPattern<tt::ttnn::BitwiseXorOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::EqualOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::NotEqualOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::GreaterEqualOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::GreaterThanOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::LessEqualOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::LessThanOp>,
           EltwiseBinaryNGCompositeOpConversionPattern<tt::ttnn::MaximumOp>,
           EltwiseBinaryNGCompositeOpConversionPattern<tt::ttnn::MinimumOp>,
           EltwiseBinaryOpConversionPattern<tt::ttnn::DivideOp>,
           EltwiseBinaryCompositeOpConversionPattern<tt::ttnn::ScatterOp>,
           EltwiseBinaryCompositeOpConversionPattern<tt::ttnn::RemainderOp>,
           EltwiseBinaryNGCompositeOpConversionPattern<tt::ttnn::PowOp>,
           EltwiseBinaryCompositeOpConversionPattern<tt::ttnn::Atan2Op>>(
          typeConverter, ctx);

  // Eltwise ternary ops
  //
  patterns.add<EltwiseTernaryOpConversionPattern<tt::ttnn::WhereOp>>(
      typeConverter, ctx);

  // Tensor manipulation ops
  //
  patterns.add<TransposeOpConversionPattern, ConcatOpConversionPattern,
               ReshapeOpConversionPattern, RepeatOpConversionPattern,
               RepeatInterleaveOpConversionPattern, SliceOpConversionPattern,
               PermuteOpConversionPattern,
               DefaultOpConversionPattern<tt::ttnn::PadOp>>(typeConverter, ctx);

  // Quantization ops.
  //
  patterns.add<QuantizationOpConversionPattern<tt::ttnn::QuantizeOp>,
               QuantizationOpConversionPattern<tt::ttnn::DequantizeOp>,
               RequantizeOpConversionPattern>(typeConverter, ctx);

  // Matmul ops
  //
  patterns.add<LinearOpConversionPattern, MatmulOpConversionPattern>(
      typeConverter, ctx);

  // Reduction ops
  //
  patterns.add<ReductionOpConversionPattern<tt::ttnn::SumOp>,
               ReductionOpConversionPattern<tt::ttnn::MeanOp>,
               ReductionOpConversionPattern<tt::ttnn::MaxOp>,
               ReductionOpConversionPattern<tt::ttnn::MinOp>,
               ProdOpConversionPattern, ArgMaxOpConversionPattern>(
      typeConverter, ctx);

  // Pooling ops
  //
  patterns.add<Pooling2dOpConversionPattern<tt::ttnn::AvgPool2dOp>,
               Pooling2dOpConversionPattern<tt::ttnn::MaxPool2dOp>>(
      typeConverter, ctx);
  patterns.add<UpsampleOpConversionPattern>(typeConverter, ctx);

  // Convolution ops
  //
  patterns.add<PrepareConv2dWeightsOpConversionPattern>(typeConverter, ctx);
  patterns.add<Conv2dOpConversionPattern>(typeConverter, ctx);
  patterns.add<ConvTranspose2dOpConversionPattern>(typeConverter, ctx);

  // Other ops
  //
  patterns.add<SoftmaxOpConversionPattern, EmbeddingOpConversionPattern,
               DefaultOpConversionPattern<tt::ttnn::EmbeddingBackwardOp>,
               MorehCumSumOpConversionPattern>(typeConverter, ctx);

  // CCL ops
  //
  patterns.add<AllGatherOpConversionPattern>(typeConverter, ctx);
  patterns.add<ReduceScatterOpConversionPattern>(typeConverter, ctx);
  patterns.add<CollectivePermuteOpConversionPattern>(typeConverter, ctx);
  patterns.add<MeshShardOpConversionPattern>(typeConverter, ctx);

  // KV Cache ops
  //
  patterns.add<DefaultOpConversionPattern<tt::ttnn::UpdateCacheOp>>(
      typeConverter, ctx);
  patterns.add<DefaultOpConversionPattern<tt::ttnn::FillCacheOp>>(typeConverter,
                                                                  ctx);

  // Arith ops
  //
  patterns.add<ArithConstantOpConversionPattern>(typeConverter, ctx);

  // Tuple ops
  //
  patterns.add<GetTupleElementOpConversionPattern>(typeConverter, ctx);
  patterns.add<TupleOpConversionPattern>(typeConverter, ctx);

  // LoadCached op
  //
  patterns.add<LoadCachedOpConversionPattern>(typeConverter, ctx);

  // Module op
  //
  patterns.add<ModuleOpConversionPattern>(typeConverter, ctx);

  // BatchNorm op
  //
  patterns.add<BatchNormOpConversionPattern>(typeConverter, ctx);
}

Writing conversion patterns to EmitC is a little tricky at first. In general case, we will be converting an op that has operands (SSAs) and attributes (e.g. data type) as arguments. We want to flatten these arguments at call site.

We'll use EmitC's CallOpaqueOp as the target op. Let's take a look at our matmul IR within TTNN dialect:

"ttnn.matmul"(%2, %4, %5) : (tensor<64x128xbf16, #ttnn_layout4>, tensor<128x96xbf16, #ttnn_layout6>, tensor<64x96xbf16, #ttnn_layout7>) -> tensor<64x96xbf16, #ttnn_layout7>

Now let's look at matmul's call signature in TTNN lib:

    static Tensor invoke(
        const Tensor& input_tensor_a,
        const Tensor& input_tensor_b,
        const bool transpose_a = false,
        const bool transpose_b = false,
        const std::optional<const MemoryConfig>& memory_config = std::nullopt,
        const std::optional<const DataType> dtype = std::nullopt,
        const std::optional<const MatmulProgramConfig>& program_config = std::nullopt,
        const std::optional<const std::string>& activation = std::nullopt,
        const std::optional<const DeviceComputeKernelConfig> compute_kernel_config = std::nullopt,
        const std::optional<const CoreGrid> core_grid = std::nullopt,
        const std::optional<const tt::tt_metal::Tile>& output_tile = std::nullopt,
        std::optional<Tensor> optional_output_tensor = std::nullopt,
        const std::optional<const DeviceGlobalCircularBuffer>& global_cb = std::nullopt);

If we look closely, we'll notice that the IR has way less arguments than can be seen in the actual signature of the op - as we're lowering to EmitC, which gets translated into actual C++ code, we need to correct for this (ideally the op would be perfectly modelled with all the arguments, but that is not the case today).

We do this by filling in the gaps. EmitC's CallOpaqueOp takes in an array of attributes, and an array of operands, which need to be combined. The combining is done by extending the array of attributes with "pointers" into operands, like so:

    llvm::SmallVector<mlir::Attribute> args{
        emitter.emit(srcOp.getA()),
        emitter.emit(srcOp.getB()),
        emitter.emit(srcOp.getTransposeA()),
        emitter.emit(srcOp.getTransposeB()),
        emitter.emit(std::nullopt) | emitter.getMemoryConfig(srcOp.getResult()),
    };

Pointers are denoted with IndexTypes, wrapped into IntegerAttrs. Attributes are converted into EmitC's OpaqueAttr which can, for practical purposes, be treated as strings: a BoolAttr carrying "false" as value needs to be converted into an OpaqueAttr whose value is a string "false", which is what the convertBoolAttr function does.

This is our final converted EmitC CallOpaqueOp:

emitc.call_opaque "ttnn::matmul"(%3, %6, %9) {args = [0 : index, 1 : index, #emitc.opaque<"false">, #emitc.opaque<"false">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, 2 : index]} : (!emitc.opaque<"ttnn::Tensor">, !emitc.opaque<"ttnn::Tensor">, !emitc.opaque<"ttnn::Tensor">) -> !emitc.opaque<"ttnn::Tensor">

which, when translated to C++ code, looks like:

ttnn::matmul(v6, v9, false, false, std::nullopt, std::nullopt, std::nullopt, std::nullopt, std::nullopt, std::nullopt, std::nullopt, v12);

Full conversion pattern for matmul op:

namespace {
class MatmulOpConversionPattern
    : public TTNNToEmitCBaseOpConversionPattern<tt::ttnn::MatmulOp> {

public:
  using TTNNToEmitCBaseOpConversionPattern<
      tt::ttnn::MatmulOp>::TTNNToEmitCBaseOpConversionPattern;

  LogicalResult
  matchAndRewrite(tt::ttnn::MatmulOp srcOp, tt::ttnn::MatmulOp::Adaptor adaptor,
                  ConversionPatternRewriter &rewriter) const override {

    ttnn_to_emitc::EmitCTTNNEmitter<tt::ttnn::MatmulOp> emitter(srcOp, adaptor,
                                                                rewriter);

    llvm::SmallVector<mlir::Attribute> args{
        emitter.emit(srcOp.getA()),
        emitter.emit(srcOp.getB()),
        emitter.emit(srcOp.getTransposeA()),
        emitter.emit(srcOp.getTransposeB()),
        emitter.emit(std::nullopt) | emitter.getMemoryConfig(srcOp.getResult()),
    };

    emitter.replaceOp(*this, args);

    return success();
  }
};
} // namespace

4. Add a compiler unit test for the Op

So far we have defined the Op in the TTIR and TTNN dialects, implemented verifiers, and have conversion passes. Now we need to add a unit test to ensure that the pass is working correctly. The compiler unit tests are located in test/ttmlir/Dialect area. In this case we'll add a test under the TTNN subdirectory since we are testing the ConvertTTIRToTTNNPass.

test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir

// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline %s | FileCheck %s
module {
  func.func @forward(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
    %0 = ttir.empty() : tensor<64x96xbf16>
    // CHECK: "ttnn.matmul"
    %1 = "ttir.matmul"(%arg0, %arg1, %0) : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
    return %1 : tensor<64x96xbf16>
  }
}

Unit tests in MLIR are typically written using a tool called FileCheck, please refer to the llvm FileCheck documentation for a tutorial and more information about the RUN and CHECK directives.

A few things to point out specifically regarding tt-mlir dialects:

  • tt.system_desc: This is a 1-1 mapping to the SystemDesc flatbuffer schema that is used to describe the system configuration. This is a required attribute tagged on the top level module for all tt-mlir dialects.
  • Pass --ttnn-layout is a prerequisite before running convert-ttir-to-ttnn. This pass is responsible for converting the input tensors to device memory space and tile layout before lowering to TTNN.
  • This test is asserting that ttir.matmul converts to ttnn.matmul.

To run the test, you can use the following command:

cmake --build build -- check-ttmlir

You can also manually run ttmlir-opt on the test file to see the resulting output:

./build/bin/ttmlir-opt --tt-register-device="system-desc-path=<PATH_TO_SYSTEM_DESC>" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir

5. Define flatbuffer schema for the Op

Next we will define the flatbuffer schema for the Op. The schema must capture all tensor inputs, outputs, and attributes of the Op, i.e. everything the runtime needs to execute the Op.

The schema can be placed in an existing .fbs file located in the include/ttmlir/Target/TTNN/operations directory.

If no suitable .fbs file exists for the operation category, feel free to create new .fbs files as needed. After creating a new .fbs file, remember to add a corresponding cmake target in the include/ttmlir/Target/TTNN/CMakeLists.txt file.

include/ttmlir/Target/TTNN/CMakeLists.txt

  operations/matmul.fbs

In our case, we can add our schema to include/ttmlir/Target/TTNN/operations/matmul.fbs directly, without needing to create a new file.

include/ttmlir/Target/TTNN/operations/matmul.fbs

table MatmulOp {
  a: tt.target.ttnn.TensorRef;
  b: tt.target.ttnn.TensorRef;
  out: tt.target.ttnn.TensorRef;
  transpose_a: bool;
  transpose_b: bool;
  matmul_program_config: tt.target.ttnn.MatmulProgramConfig;
}

Type TensorRef, flatbuffer tables with suffix Ref are used to represent live values during the runtime, decoupled from the underlying Desc suffixes which carry the type and attribute information for the object.

After creating the schema for our new operation type, we need to register it in the OpType union within program.fbs. This file serves as the main entry point for all program information, where the OpType union collects and defines all supported operation types and their corresponding schemas.

include/ttmlir/Target/TTNN/program.fbs

  MatmulOp,

If a new .fbs file was created, don't forget to include the new file in include/ttmlir/Target/TTNN/program.fbs.

include "ttmlir/Target/TTNN/operations/matmul.fbs";

More information about writing flatbuffer schemas can be found in the flatbuffers documentation

6. Serialize the Op in the flatbuffer format

In the previous section we defined the flatbuffer schema for the matmul Op, now let's put our new schema definition to use. The schema is used as input to a program called flatc which generates C++ code (or any language for that matter) for serializing and deserializing the schema. This generated code can be found in build/include/ttmlir/Target/TTNN/program_generated.h.

Let's head over to lib/Target/TTNN/TTNNToFlatbuffer.cpp to define a createOp overloaded function that does the conversion from MLIR to flatbuffer:

::flatbuffers::Offset<::tt::target::ttnn::MatmulOp>
createOp(FlatbufferObjectCache &cache, MatmulOp op) {
  auto a = cache.at<::tt::target::ttnn::TensorRef>(
      getOperandThroughDPSOps(op.getA()));
  auto b = cache.at<::tt::target::ttnn::TensorRef>(
      getOperandThroughDPSOps(op.getB()));
  auto output = cache.getOrCreate(op.getResult(), tensorValueToFlatbuffer,
                                  kHostAllocatedSize);

  using MatmulConfigType = ::tt::target::ttnn::MatmulProgramConfig;
  MatmulConfigType matmulProgramConfigType = MatmulConfigType::NONE;
  ::flatbuffers::Offset<void> matmulProgramConfigDesc;
  if (auto matmulProgramConfig = op.getMatmulProgramConfigAttr()) {
    if (auto config =
            mlir::dyn_cast<ttnn::MatmulMultiCoreReuseProgramConfigAttr>(
                matmulProgramConfig)) {
      matmulProgramConfigType =
          MatmulConfigType::MatmulMultiCoreReuseProgramConfig;
      matmulProgramConfigDesc = toFlatbuffer(cache, config).Union();
    } else if (auto config = mlir::dyn_cast<
                   ttnn::MatmulMultiCoreReuseMultiCastProgramConfigAttr>(
                   matmulProgramConfig)) {
      matmulProgramConfigType =
          MatmulConfigType::MatmulMultiCoreReuseMultiCastProgramConfig;
      matmulProgramConfigDesc = toFlatbuffer(cache, config).Union();
    } else if (auto config = mlir::dyn_cast<
                   ttnn::MatmulMultiCoreReuseMultiCast1DProgramConfigAttr>(
                   matmulProgramConfig)) {
      matmulProgramConfigType =
          MatmulConfigType::MatmulMultiCoreReuseMultiCast1DProgramConfig;
      matmulProgramConfigDesc = toFlatbuffer(cache, config).Union();
    } else if (
        auto config = mlir::dyn_cast<
            ttnn::MatmulMultiCoreReuseMultiCastDRAMShardedProgramConfigAttr>(
            matmulProgramConfig)) {
      matmulProgramConfigType = MatmulConfigType::
          MatmulMultiCoreReuseMultiCastDRAMShardedProgramConfig;
      matmulProgramConfigDesc = toFlatbuffer(cache, config).Union();
    }
  }

  return ::tt::target::ttnn::CreateMatmulOp(
      *cache.fbb, a, b, output, op.getTransposeA(), op.getTransposeB(),
      matmulProgramConfigType, matmulProgramConfigDesc);
}

Lots of things are happening here, let's break it down:

  • FlatbufferObjectCache: This is a helper class that is used to cache objects in the flatbuffer that are created during the serialization process. This is necessary for managing value lifetimes and identifiers, at the same time it is an optimization to avoid having multiple copies of the same object. For example, a TensorRef with multiple uses could naively be recreated, one for each use, but with the cache we can ensure that the object is only created once and all uses point to the same flatbuffer offset. The cache is passed around to all serialization functions and should be used whenever creating a new object.
  • getOperandThroughDPSOps: In section 1. we discussed DPS semantics and the drawback of having the result alias the output tensor. This is one of those cases where we need to use a helper function to trace through the output operands to find the original SSA name in order to associate it with the original TensorRef.
  • CreateMatmulOp: The autogenerated function from the flatbuffer schema that actually serializes the data into the flatbuffer format.

We can finally generate a binary with our new Op! We can use the following command:

./build/bin/ttmlir-opt --tt-register-device="system-desc-path=<PATH_TO_SYSTEM_DESC>" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir | ./build/bin/ttmlir-translate --ttnn-to-flatbuffer -o out.ttnn

And we can inspect the with ttrt:

ttrt read out.ttnn

Note: If the above ttrt command yields a segfault, a clean build of your workspace may be required: Build Instructions

7. Add runtime support for the Op

Next, we want to add runtime support for the Op by parsing the flatbuffer and invoking the TTNN API.

runtime/lib/ttnn/operations/matmul/matmul.cpp

void run(const ::tt::target::ttnn::MatmulOp *op, ProgramContext &context) {
  ProgramTensorPool &tensorPool = context.getTensorPool();
  const ::ttnn::Tensor &lhs = tensorPool.getTTNNTensorAndValidate(op->a());
  const ::ttnn::Tensor &rhs = tensorPool.getTTNNTensorAndValidate(op->b());

  auto outputMemoryConfig =
      ::tt::runtime::ttnn::utils::createMemoryConfigIfNeeded(
          ::tt::runtime::ttnn::utils::getTensorRefMemoryConfig(op->out()));
  LOG_ASSERT(::tt::runtime::ttnn::utils::inSystemMemory(op->out()) ||
                 outputMemoryConfig,
             "Memory config must exist for device tensors");

  ::ttnn::DataType outputDataType = utils::getDataType(op->out());

  std::optional<::ttnn::operations::matmul::MatmulProgramConfig>
      matmulProgramConfig = utils::createMatmulProgramConfigIfNeeded(op);

  ::ttnn::Tensor output = ::ttnn::matmul(
      lhs, rhs, op->transpose_a(), op->transpose_b(), outputMemoryConfig,
      outputDataType, matmulProgramConfig,
      /*activation=*/std::nullopt, /*compute_kernel_config=*/std::nullopt,
      /*core_grid=*/std::nullopt, /*output_tile=*/std::nullopt,
      /* optional_output_tensor=*/std::nullopt);

  tensorPool.insertTTNNTensorAndValidate(op->out(), output);
}

A couple things to note from above:

  • Most runtime op functions will follow a similar pattern, they will take in some additional datastructures for managing the program context.
    • Program context tracks the state of the current program. It stores intermediate tensors and devices.
  • tensorPool.at(op->in0()->global_id()): global_id is a unique identifier for the tensor that was generated and managed by the FlatbufferObjectCache. This is how it's intended to be used by the runtime.
  • Some operations may belong to a larger set of operations. For example, any eltwise unary operations can be added in runtime/lib/ttnn/operations/eltwise/unary.cpp directly without needing to create a new file.

If a new file is created for the op, we need to add a new source to runtime/lib/ttnn/operations/CMakeLists.txt and a new case to runtime/lib/ttnn/program_executor.cpp.

To update runtime/lib/ttnn/operations/CMakeLists.txt, include the path to the source file in TTNN_OPS_SRCS:

runtime/lib/ttnn/operations/CMakeLists.txt

  ${CMAKE_CURRENT_SOURCE_DIR}/matmul/matmul.cpp

To update runtime/lib/ttnn/program_executor.cpp, add a new case to the runOperation method of ProgramExecutor:

runtime/lib/ttnn/program_executor.cpp

  case ::tt::target::ttnn::OpType::MatmulOp: {
    return operations::matmul::run(op->type_as_MatmulOp(), getContext());
  }

We can test our changes with ttrt (don't forget to rebuild ttrt):

ttrt run out.ttnn

8. Add a silicon unit test for the Op

After adding runtime support, we're ready to test our Op on silicon. All silicon tests are located under test/ttmlir/Silicon. The process is similar to adding a compiler unit test.

In our specific case, we create a unit test here:

test/ttmlir/Silicon/TTNN/matmul/simple_matmul.mlir

// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%" %s > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn
module {
  func.func @forward(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
    %0 = ttir.empty() : tensor<64x96xbf16>
    // CHECK: "ttnn.matmul"
    %1 = "ttir.matmul"(%arg0, %arg1, %0) : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
    return %1 : tensor<64x96xbf16>
  }

  func.func @matmul_transpose_lhs(%arg0: tensor<64x128xbf16>, %arg1: tensor<64x128xbf16>) -> tensor<128x128xbf16> {
    %0 = ttir.empty() : tensor<128x128xbf16>
    // CHECK: "ttnn.matmul"
    %1 = "ttir.matmul"(%arg0, %arg1, %0) <{transpose_a = true}>: (tensor<64x128xbf16>, tensor<64x128xbf16>, tensor<128x128xbf16>) -> tensor<128x128xbf16>
    return %1 : tensor<128x128xbf16>
  }

  func.func @matmul_transpose_rhs(%arg0: tensor<64x128xbf16>, %arg1: tensor<64x128xbf16>) -> tensor<64x64xbf16> {
    %0 = ttir.empty() : tensor<64x64xbf16>
    // CHECK: "ttnn.matmul"
    %1 = "ttir.matmul"(%arg0, %arg1, %0) <{transpose_b = true}>: (tensor<64x128xbf16>, tensor<64x128xbf16>, tensor<64x64xbf16>) -> tensor<64x64xbf16>
    return %1 : tensor<64x64xbf16>
  }
}

Couple things to point out about this process:

  • Tests placed under test/ttmlir/Dialect will only test the compiler's capability of compiling the module. If you want the module to run on silicon in CI, the test must be placed under test/ttmlir/Silicon.
  • Notice the differences between the compilation headers of test/ttmlir/Silicon/TTNN/simple_matmul.mlir and test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir
    • --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%": The system-desc-path option specifies the location of the system descriptor required for compiling the module. This is crucial for silicon tests, as modules compiled with different system descriptors may vary in silicon compatibility. Ensuring the system descriptor accurately reflects the target hardware is essential for running the module correctly.
    • // RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn: This runs ttmlir-translate that serializes the output mlir module to a flatbuffer binary. We added the logic for this serialization in the Serialize the Op in the flatbuffer format section.

9. Add an EmitC test for the Op

Op should be tested in the EmitC (C++ codegen) path as well.

TTNN EmitC tests live in the test/ttmlir/EmitC/TTNN path. In our case, the test is in test/ttmlir/EmitC/TTNN/matmul/matmul.mlir.

test/ttmlir/EmitC/TTNN/matmul/matmul.mlir

// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%" %s > %t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %basename_t.ttnn
// RUN: ttmlir-opt --ttnn-tuplify-tensors --convert-ttnn-to-emitc %t.mlir > %t2.mlir
// RUN: ttmlir-translate --mlir-to-cpp %t2.mlir > %basename_t.cpp

func.func @matmul(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
  %0 = ttir.empty() : tensor<64x96xbf16>
  %1 = "ttir.matmul"(%arg0, %arg1, %0) : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
  return %1 : tensor<64x96xbf16>
}

The first two RUN lines create a flatbuffer. The third and forth convert to EmitC dialect, translate to C++, then output the result to matmul.mlir.cpp file.

Additionally, the op's header file operations/matmul/matmul.hpp should be added to the list of includes in tools/ttnn-standalone/ttnn-precompiled.hpp:

#include "operations/ccl/all_gather/all_gather.hpp"
#include "operations/ccl/ccl_host_types.hpp"
#include "operations/ccl/reduce_scatter/reduce_scatter.hpp"
#include "operations/conv/conv2d/conv2d.hpp"
#include "operations/conv/conv2d/prepare_conv2d_weights.hpp"
#include "operations/conv/conv_transpose2d/conv_transpose2d.hpp"
#include "operations/core/core.hpp"
#include "operations/creation.hpp"
#include "operations/data_movement/concat/concat.hpp"
#include "operations/data_movement/permute/permute.hpp"
#include "operations/data_movement/repeat/repeat.hpp"
#include "operations/data_movement/repeat_interleave/repeat_interleave.hpp"
#include "operations/data_movement/slice/slice.hpp"
#include "operations/data_movement/transpose/transpose.hpp"
#include "operations/eltwise/binary/binary.hpp"
#include "operations/eltwise/binary/binary_composite.hpp"
#include "operations/eltwise/quantization/quantization.hpp"
#include "operations/eltwise/unary/unary_composite.hpp"
#include "operations/embedding/embedding.hpp"
#include "operations/embedding_backward/embedding_backward.hpp"
#include "operations/matmul/matmul.hpp"
#include "operations/moreh/moreh_cumsum/moreh_cumsum.hpp"
#include "operations/normalization/batch_norm/batch_norm.hpp"
#include "operations/normalization/softmax/softmax.hpp"
#include "operations/pool/generic/generic_pools.hpp"
#include "operations/pool/upsample/upsample.hpp"
#include "operations/reduction/argmax/argmax.hpp"
#include "operations/reduction/generic/generic_reductions.hpp"
#include "operations/reduction/prod/prod.hpp"
#include "tt-metalium/bfloat16.hpp"
#include "tt-metalium/small_vector.hpp"
#include "ttnn/core.hpp"
#include "ttnn/device.hpp"
#include "ttnn/operations/copy/typecast/typecast.hpp"
#include "ttnn/tensor/tensor.hpp"
#include "ttnn/tensor/types.hpp"
#include "ttnn/types.hpp"
#include "workarounds.hpp"

Decomposing an Op in TTIR

This guide explains how to add and decompose a new operation in the TTIR dialect. We’ll focus on adding an Index operation, which will be decomposed into the Slice operation. The decomposition is implemented as a conversion pass in MLIR since it allows us to mark operations or dialects as legal or illegal, type conversion...

This guide will cover the following steps:

1. Define the Op in the TTIR frontend dialect

The more information regarding this step can be found here: Define the Op in the TTIR frontend dialect

I updated the TTIROps.td as following:

def TTIR_IndexOp: TTIR_NamedOp<"index"> {
    let summary = "Tensor indexing operation.";
    let description = [{
      The `index` operation extracts a sub-tensor (slice) from the input tensor along a specified dimension.

      This operation selects elements from the input tensor along a single dimension based on the specified
      begin, end, and step indices. It's similar to Python's slicing notation `tensor[:, begin:end:step, :]`
      where the slicing is applied only to the specified dimension.

      Example:
      ```mlir
      // Extract elements with indices 1, 3, 5 from dimension 0 of a 1D tensor
      %input = ... : tensor<6xf32>  // Input tensor with values: [1, 2, 3, 4, 5, 6]
      %output = ttir.empty() : tensor<3xf32>  // Output tensor shape
      %result = ttir.index(%input, %output) {
          dim = 0 : i32,    // Dimension to index
          begin = 1 : i32,  // Start index
          end = 6 : i32,    // End index (exclusive)
          step = 2 : i32    // Step size
      } : tensor<6xf32>, tensor<3xf32> -> tensor<3xf32>
      // Result: [2, 4, 6]

      // Extract columns 0 and 2 from a 2D tensor
      %input = ... : tensor<3x4xf32>  // Input tensor with values:
                                      // [[1, 2, 3, 4],
                                      //  [5, 6, 7, 8],
                                      //  [9, 10, 11, 12]]
      %output = ttir.empty() : tensor<3x2xf32>  // Output tensor shape
      %result = ttir.index(%input, %output) {
          dim = 1 : i32,    // Index along columns (dimension 1)
          begin = 0 : i32,  // Start from first column
          end = 3 : i32,    // End at third column (exclusive)
          step = 2 : i32    // Take every other column
      } : tensor<3x4xf32>, tensor<3x2xf32> -> tensor<3x2xf32>
      // Result:
      // [[1, 3],
      //  [5, 7],
      //  [9, 11]]
      ```

      Inputs:
      - `input` (Tensor): The input tensor to index.

      Attributes:
      - `dim` (Integer): The dimension along which to index.
      - `begin` (Integer): The starting index.
      - `end` (Integer): The ending index (exclusive).
      - `step` (Integer): The step size between indices.

      Outputs:
      - `result` (Tensor): The indexed tensor.

      Note: The shape of the output tensor is the same as the input tensor except for the indexed dimension,
      which will have size `ceil((end - begin) / step)`. The indices selected will be `begin`, `begin + step`,
      `begin + 2*step`, etc., up to but not including `end`.
    }];

    let arguments = (ins AnyRankedTensor:$input,
                         AnyRankedTensor:$output,
                         I32Attr:$dim,
                         I32Attr:$begin,
                         I32Attr:$end,
                         I32Attr:$step);

    let results = (outs AnyRankedTensor:$result);

    let hasVerifier = 1;
}

The verification function has been added as well:

// IndexOp verification
::mlir::LogicalResult mlir::tt::ttir::IndexOp::verify() {
  ::mlir::RankedTensorType inputType = getInput().getType();
  ::llvm::ArrayRef<int64_t> inputShape = inputType.getShape();
  ::mlir::RankedTensorType outputType = getOutput().getType();
  int32_t dim = getDim();
  int32_t begin = getBegin();
  int32_t end = getEnd();
  int32_t step = getStep();

  // Verify that the input is at least 1D tensor
  if (inputType.getRank() < 1) {
    return emitOpError("Input must be at least a 1D tensor");
  }

  // Validate that the output tensor has the same element type as the input
  // tensor
  if (inputType.getElementType() != outputType.getElementType()) {
    return emitOpError(
        "Output tensor must have the same element type as the input tensor");
  }

  // Verify the output tensor rank
  if (inputType.getRank() != outputType.getRank()) {
    return emitOpError(
        "Output tensor must have the same rank as the input tensor");
  }

  // Verify that the dim attribute is within the bounds of the input tensor
  if (dim < 0 || dim >= inputType.getRank()) {
    return emitOpError() << "Invalid dimension index " << dim
                         << ". Input tensor rank is " << inputType.getRank();
  }

  // Verify begin, end, step and the output tensor dimensions
  int64_t dimSize = inputShape[dim];

  // Adjust negative begin and end
  int32_t adjustedBegin = (begin < 0) ? (begin + dimSize) : begin;
  int32_t adjustedEnd = (end < 0) ? (end + dimSize) : end;

  std::ostringstream inputShapeStream;
  inputShapeStream << "(";
  for (size_t i = 0; i < inputShape.size(); ++i) {
    inputShapeStream << inputShape[i];
    if (i != inputShape.size() - 1) {
      inputShapeStream << ", ";
    }
  }
  inputShapeStream << ")";
  std::string inputShapeStr = inputShapeStream.str();

  if (adjustedBegin < 0 || adjustedBegin >= dimSize) {
    return emitOpError() << "Invalid begin index for dimension "
                         << std::to_string(dim) << ". Expected value in range ["
                         << std::to_string(-dimSize) << ", " << dimSize
                         << "), got " << begin
                         << ". Input shape: " << inputShapeStr;
  }
  if (adjustedEnd < 0 || adjustedEnd > dimSize) {
    return emitOpError() << "Invalid end index for dimension "
                         << std::to_string(dim) << ". Expected value in range ["
                         << std::to_string(-dimSize) << ", " << dimSize
                         << "], got " << end
                         << ". Input shape: " << inputShapeStr;
  }

  auto formatValueMessage = [](int value, int adjustedValue) {
    return value < 0 ? std::to_string(adjustedValue) + " (" +
                           std::to_string(value) + ")"
                     : std::to_string(value);
  };
  std::string beginValueMessage = formatValueMessage(begin, adjustedBegin);
  std::string endValueMessage = formatValueMessage(end, adjustedEnd);

  if (step == 0) {
    return emitOpError("Step value for dimension " + std::to_string(dim) +
                       " cannot be zero");
  }

  if (step > 0 && adjustedBegin > adjustedEnd) {
    return emitOpError() << "For positive step, begin index must be less "
                            "than or equal to end index for dimension "
                         << dim << ". Got begin: " << beginValueMessage
                         << ", end: " << endValueMessage << ", step: " << step
                         << ", input shape: " << inputShapeStr;
  }

  if (step < 0 && adjustedBegin < adjustedEnd) {
    return emitOpError() << "For negative step, begin index must be greater "
                            "than or equal to end index for dimension "
                         << dim << ". Got begin: " << beginValueMessage
                         << ", end: " << endValueMessage << ", step: " << step
                         << ", input shape: " << inputShapeStr;
  }

  // Calculate the expected size of the output dimension
  int32_t expectedDimSize =
      (std::abs(adjustedEnd - adjustedBegin) + std::abs(step) - 1) /
      std::abs(step);
  if (outputType.getDimSize(dim) != expectedDimSize) {
    return emitOpError() << "Mismatch in dimension " << std::to_string(dim)
                         << " of the output tensor: expected size "
                         << expectedDimSize << ", but got "
                         << outputType.getDimSize(dim);
  }

  return success();
}

2. Create a conversion pattern

A conversion pattern defines how MLIR should rewrite the Op. It can be implemented in either C++ or TableGen. Currently, we only have the C++ implementation; TableGen format will be added in the future.

C++ conversion pattern

For the Index operation, we use the C++ conversion pattern because it involves changing the Op’s input types from integers to arrays, which TableGen lacks flexibility for.

// This transformation adjusts IndexOp attributes so that `begin`, `end`, and
// `step` become arrays, where each array element corresponds to a dimension of
// the input tensor. For dimensions other than the sliced dimension, default
// values are used.
//
namespace {
struct IndexToSliceConversionPattern
    : public OpConversionPattern<ttir::IndexOp> {
  using OpConversionPattern<ttir::IndexOp>::OpConversionPattern;

  LogicalResult
  matchAndRewrite(ttir::IndexOp op, OpAdaptor adaptor,
                  ConversionPatternRewriter &rewriter) const override {
    auto inputType =
        ::mlir::dyn_cast<mlir::RankedTensorType>(adaptor.getInput().getType());
    if (!inputType || !inputType.hasRank()) {
      return failure();
    }

    int64_t rank = inputType.getRank();
    llvm::SmallVector<mlir::Attribute, 4> begins, ends, steps;

    for (int64_t i = 0; i < rank; ++i) {
      if (i == op.getDim()) {
        begins.push_back(rewriter.getI32IntegerAttr(adaptor.getBegin()));
        ends.push_back(rewriter.getI32IntegerAttr(adaptor.getEnd()));
        steps.push_back(rewriter.getI32IntegerAttr(adaptor.getStep()));
      } else {
        begins.push_back(rewriter.getI32IntegerAttr(0));
        ends.push_back(rewriter.getI32IntegerAttr(inputType.getDimSize(i)));
        steps.push_back(rewriter.getI32IntegerAttr(1));
      }
    }

    auto newOp = rewriter.create<ttir::SliceOp>(
        op.getLoc(), op.getType(), adaptor.getInput(), adaptor.getOutput(),
        rewriter.getArrayAttr(begins), rewriter.getArrayAttr(ends),
        rewriter.getArrayAttr(steps));

    rewriter.replaceOp(op, newOp.getResult());
    return success();
  }
};
} // namespace

The matchAndRewrite method from OpConversionPattern is implemented to replace the matched Op with the newly created Op. Since decomposition is implemented as a conversion pass, OpAdaptor is used to access the attributes of the original Op in their converted types. Finally, we instantiate the new Op and call the replaceOp method on ConversionPatternRewriter to replace the original Op.

Tablegen conversion pattern

TODO

3. Register the created conversion pattern

To register the new pattern, go to the populateTTIRToTTIRDecompositionPatterns function in TTIRToTTIRDecomposition.cpp and add it to RewritePatternSet using the add method. After that is done you should mark the decomposed op as illegal in runOnOperation method of TTIRToTTIRDecompositionPass in TTIRToTTIRDecompositionPass.cpp.

You should also add a silicon test like described here: Add a silicon unit test for the Op. This is how the silicon test for the Index operation looks like:

// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%" %s > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn
module attributes {} {
  func.func @forward(%arg0: tensor<4x32x32xbf16>) -> tensor<4x32x16xbf16> {
    %0 = ttir.empty() : tensor<4x32x16xbf16>
    // CHECK: = "ttnn.slice"
    %1 = "ttir.index"(%arg0, %0) <{dim = 2: i32, begin = 0: i32, end = 32: i32, step = 2: i32}> : (tensor<4x32x32xbf16>, tensor<4x32x16xbf16>) -> tensor<4x32x16xbf16>
    return %1 : tensor<4x32x16xbf16>
  }
}

Doxygen

This is a link to a doxygen autogenerated code reference. Doxygen

Build Instructions

To build Doxygen use the doxygen target in CMake

cmake -B build
cmake --build build -- doxygen

Specifications

Specifications are documents that define the requirements for features or concepts that are particularly cross-cutting, complex, or require a high degree of coordination and planning. They are intended to be a living document that evolves as the feature is developed and should be maintained as the goto reference documentation for the feature or concept.

Specifications are written in markdown and are stored in the docs/src/specs directory of the repository. Below is a template that should be used when creating a new specification.

Specification Template

# [Title]

A brief description of the feature or concept that this specification is
defining.

## Motivation

A description of why this feature or concept is needed and what problem it is
solving. This section is best written by providing concrete examples and use
cases.

## Proposed Changes

A list of the components that will be impacted by this spec and a detailed
description of the changes that will be made to each respective component.

It should also call out any interactions between components and how they might
share an interface or communicate with each other.

## Test Plan

A brief description of how the feature or concept will be tested.

## Concerns

A list of concerns that have been identified during the design of this feature.

Runtime Stitching

Runtime stitching adds the ability for the runtime to stitch together multiple, indepently compiled programs together at runtime, ie. without compiler knowledge of how the binary programs will be composed.

Motivation

In order to flexibly support arbitrary training schedules / composing multiple models together we want to have the ability for the runtime to stitch graphs together. To achieve this we need to define an ABI kind of interface between the compiler and the runtime.

Simple Example

mod_a = forge.compile(PyTorch_module_a)
mod_b = forge.compile(PyTorch_module_b)

for i in range(10):
    outs_a = mod_a(ins_a)
    outs_b = mod_b(outs_a)

mod_a and mod_b are 2 independent compile steps, during the compile step for mod_a it should be completely unaware that mod_b will take place and vice-versa. In order to achieve this we propose a new runtime concept called stitching:

  • forge invokes compile step for mod_a, tt-mlir compiler determines where the inputs (ins_a) should live, host, device dram, device l1. tt-mlir returns metadata to forge describing where it wants the tensors to reside before invoking flatbuffer submission.
  • forge invokes compile step for mod_b, same happens as bullet 1
  • mod_a is invoked at runtime, forge runtime needs to inspect the compiler metadata to determine where the tensors should live. Runtime manually invokes a new data copy command to get the tenors to the correct memory space / correct memory address.
  • forge runtime invokes mod_a program submit
  • mod_b is invoked at runtime, this time it might be that the compiler left the tensor outputs in L1, so no data copy is needed to start running mod_b since the inputs are already in the correct location.

A more concrete usecase would be a training loop where there are often multiple graphs composed together. #82 Or when we eventually support torch 2.0, the torch runtime can arbitrarily break the graph anywhere.

Proposed Changes

Compiler Metadata

Compiler will encode the input tensor layout information directly into the flatbuffer tensor desc. The flatbuffer schema already exists to express this, we just need to adopt populating it instead of assuming a canonical host layout.

Compiler will decide where the tensors should live, host, device dram, device l1.

Runtime

  • Runtime will inspect the tensor desc metadata to determine where the tensors need to end up / what layout they should be in before invoking the program.
  • New runtime API Tensor toLayout(Tensor tensor, ::tt::target::TensorDesc* tensorDesc);
  • Runtime will need to invoke toLayout on all input tensors before invoking the program.

Test Plan

  • Add a new test to the runtime gtest suite that verifies the runtime can correctly stitch together 2 independently compiled programs.

Concerns

  • Tensors pass through device memory spaces (dram, L1) will have a dynamic address, some arbitrary run order of flatbuffer could cause tensors to end up in non-ideal locations in memory. Specifically, L1, a poorly placed tensor might not be able to be moved to a better location without a bounce through DRAM.

Tensor Layout

The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.

Motivation / High level goals

  • Logical shapes: Keep the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
  • Flexible sharding: Enable flexibility in choosing grid shape, to get better parallelization and avoid resharding. This is particularly important in cases where tensor shapes are not clean powers of two and would otherwise force our hand in choosing non-optimal grid shapes.
  • Logical-Physical Isomorphism: Encode this information with just a few attributes to enable derived conversions from logical to physical layout and back.
  • Explicit: A single source of truth.
  • Enable a direct way to query padded regions.

An Example / Walkthrough

Let's consider a snippet of MLIR:

tensor<2x3x64x128xf32>

Here we've defined a 4 dimensional tensor using MLIR's builtin tensor type. This tensor type has an optional attribute called an Encoding, this attribute has been used by the TT dialect to encode the tensor's layout. This looks like:

tensor<2x3x64x128xf32,
  #tt.metal_layout<
    (d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
    undef,
    <1x1>,
    memref<384x128xf32, #tt.memory_space<l1>>
  >
>

At the time of this writing there are 4 properties that make up a tensor layout:

  • linear: An affine map that defines how the logical tensor dimensions map to a grid shape. Note that the number of dims in the affine map must match exactly the rank of the original tensor, and the number of results must match exactly the rank of the grid shape.
  • oob_val: A tracked out of bounds value that fills padding space.
  • grid: The grid shape that this tensor is divided onto.
  • memref: A memref that describes the physical footprint allocation of the shard. It must also have a shape with rank equal to grid.

This example isn't particularly complicated because it's only sharded to a 1x1 grid, the rest of the document will go into more details on the following topics:

Before we jump into more advanced topics there are two resources that could be useful to have at hand:

  • test/python/tensor_layout.py: Python test with many convenience functions for creating and experimenting with tensor layouts.
  • TTNN Interactive Visualizer: An interactive visualation tool that demonstrates the transformation. Note that this tool was created for TTNN tensor layout, but many of the same concepts transfer over.

Dimension Collapsing

Probably the most important concept in tt.metal_layout is dimension collapsing. This is captured by the affine map linear property which provides a mapping from tensor dim space to a reduced physical dimensional space. This single-handedly touches on most of the tensor layout goals mentioned at the beginning of the doc:

  • Leaves tensor shapes intact
  • Logical-Physical mapping, how the tensor is laid out in memory over a grid
  • Enables more flexible sharding
  • Explicit padding

To see how these goals are achieved we'll continue working on an explicit example, same one as above:

(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3)

To recap, we have our example 4d tensor (2, 3, 64, 128), which maps directly to the LHS (d0, d1, d2, d3). We have our 2d grid shape (1, 1), notice the affine-map RHS is also 2d, and this describes how tensor dims map to a lower dimensional physical memory, overlaid on a grid. We'll see how this gets divided onto the grid later, but first let's look at how this forms an affine-map iteration space. If we index our tensor at say [1, 1, 6, 100], we can simply plugin those numbers to get our remapped offset:

(1 * 192 + 1 * 64 + 6, 100) = (262, 100)

This remapped offset (262, 100) corresponds to the row and column index of the collapsed physical memory.

By default, the dim range [0, -1) is collapsed, but the tt.metal_layout contructor can actually take a programmable range called collapseIntervals. collapseIntervals is a list of pairs, where each pair is a dim range interval, left inclusive, right exclusive. Let's consider a few examples:

Instead of multiplying out real shapes, we will use <> to represent a dimension join operator.

  • 3D tensor onto a 2D grid and default collapseIntervals=[(0, -1)]:
(d0, d1, d2) -> (d0 <> d1, d2)
  • 4D tensor onto a 3D grid and collapseIntervals=[(1, -1)]:
(d0, d1, d2, d3) -> (d0, d1 <> d2, d3)
  • 4D tensor onto a 3D grid and collapseIntervals=[(0, 2)]:
(d0, d1, d2, d3) -> (d0 <> d1, d2, d3)
  • 7D tensor onto a 4D grid and collapseIntervals=[(0, 3), (-3, -1)]:
(d0, d1, d2, d3, d4, d5, d6) -> (d0 <> d1 <> d2, d3, d4 <> d5, d6)

Multi-core

Let's consider the original example again, but on a larger grid than 1x1, say 2x4:

tensor<2x3x64x128xf32,
  #tt.metal_layout<
    (d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
    undef,
    <2x4>,
    memref<192x32xf32, #tt.memory_space<l1>>
  >
>

The number of affine map results, grid shape, and memref shape all must have the same rank. We can see in this example by changing the grid shape we also changed the memref shape, we can always calculate the memref shape by plugging in the full tensor dims into our affine map and then dividing by grid shape.

(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
(2 - 1, 3 - 1, 64 - 1, 128 - 1) = (1 * 192 + 2 * 64 + 63, 127) = (383, 127)

Above we actually subtracted 1 in order to get the index of the last element of the tensor. Now we can simply add back 1 to get the size:

(383 + 1, 127 + 1) = (384, 128)

Finally, we divide the dims by the respective grid dims:

(384 / 2, 128 / 4) = (192, 32)

Here's a few more example mlir snippets:

tensor<8x300xf32,
  #tt.metal_layout<(d0, d1) -> (d0, d1),
    undef,
    <1x2>,
    memref<8x150xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.metal_layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
    undef,
    <2x1>,
    memref<384x32xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.metal_layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
    undef,
    <2x1x2>,
    memref<384x96x16xf32, #tt.memory_space<l1>>
  >
>

tensor<5x3x2x2x7x32x32xf32,
  #tt.metal_layout<
    (d0, d1, d2, d3, d4, d5, d6)
      -> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
    undef,
    <3x2x2x2>,
    memref<4480x4x16x16xf32, #tt.memory_space<l1>>
  >
>

A couple of final notes regarding grid shape:

  • Grid shapes of rank > 2 are perfectly legal. Not only it this useful for describing multi-device grid topologies, but it is often convenient to have higher ranked grids to better describe how a high rank tensor should be divided. The grid shape here is a virtual grid shape, the tt.device attribute will hold an additional affine map that defines how this virtual grid shape maps to a physical one.
  • Grid shapes where either columns or rows are > physical device grid is also legal. Since this is only a virtual grid shape we could have some grid 1x64 that maps to a physical 8x8 device grid (this particular example is called width sharding in TTNN).

Tilized

A tilized tensor is one with a memref that has a tile element type.

Given some tensor with scalar layout:

tensor<3x64x128xf32,
  #tt.metal_layout<
    (d0, d1, d2) -> (d0 * 64 + d1, d2),
    undef,
    <3x2>,
    memref<64x64xf32, #tt.memory_space<l1>>
  >
>

After tilizing we'll have:

tensor<3x64x128xf32,
  #tt.metal_layout<
    (d0, d1, d2) -> (d0 * 64 + d1, d2),
    undef,
    <3x2>,
    memref<2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Notice the memref dim was ceilDiv'd by tile shape and the element type becomes a tt.tile type. Also notice that the tensor shape and element type remains intact.

Padding

Padding can be a bit of an overloaded term, but in this context it refers to an out of bounds area in the physical memory allocation that has no real tensor data in it. The contents of this area is tracked by oob_val and the padding area can be automatically derived from the attributes of tt.metal_layout.

Padding is a necessary evil that arises when a tensor is not evenly divisible by a grid shape or tile shape. It can also arise due to minimum Noc addressing requirements.

Example of non-divisible grid:

tensor<53x63xf32,
  #tt.metal_layout<
    (d0, d1) -> (d0, d1),
    undef,
    <3x2>,
    memref<18x32xf32, #tt.memory_space<l1>>
  >
>

The grid dims always ceilDiv the affine map results, real tensor data will entirely fill initial shards and the last shard in each dimension will be partially filled.

In this particular example, we have 1 scalar row of padding on the last row of cores and 1 scalar column of padding on the last column of cores.

Taking the above example a step further, we could tilize it:

tensor<53x63xf32,
  #tt.metal_layout<
    (d0, d1) -> (d0, d1),
    undef,
    <3x2>,
    memref<1x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Tile dims also always ceilDiv the resulting memref shape. Notice now that the padding is slightly more complicated. Our scalar shard shape was 18x32, but this was further padded to 32x32 meaning that every core now has 14 rows of padding except for the last row of cores which has 15 rows of padding.

Also note that there is an order of operations here, grid divides the scalar shape first and then we tilize. This is important because it can enable use cases that frequently arise in conv networks that would otherwise result in reshards in between every layer.

With affine map we can be even more flexible in how we pad, we can bump our stride between dimensions. Consider tensor (w/ batch dim 2):

tensor<2x8x32xf32,
  #tt.metal_layout<
    (d0, d1, d2) -> (d0 * 8 + d1, d2),
    undef,
    <1x2>,
    memref<16x16xf32, #tt.memory_space<l1>>
  >
>

If we tilized the above tensor we'd end up with a memref shape of 1x1x!tt.tile<32x32>, that is, all batches are tightly packed within a single tile. Let's say that for some reason, we do not want the batches (2) to be tightly packed within a tile, perhaps the mathematical operation we're doing requires the batch to be independently evaluated and thus the (S)FPU needs them in separate tiles. We can adjust this by adjusting the stride of the affine map:

(d0, d1, d2) -> (d0 * 32 + d1, d2),

Instead of striding by the number of logical rows, 8, we bump the stride up to 32 effectively pushing a gap between the collapsed rows and enabling each batch to fall on a tile boundary.

Memory Spaces

At the time of writing this document there are 4 memory spaces:

  1. System: Host memory space that is not device visible.
  2. SystemMMIO: Host memory space that is device visible.
  3. DeviceDRAM: DRAM local to the device.
  4. DeviceL1: SRAM on each core.

Something worth noting here is that a tensor must belong exclusively to only one of these memory spaces at a time. For example, in order to stream tensor data from DeviceDRAM to DeviceL1 you would need to either manually slice the tensor into smaller tensors that do fit in L1 or have native support in the op's kernel for double buffering a block (most TTNN ops already support this).

Multi-device

Multi-device can be naturally represented via a combination of two concepts already touched on above, higher ranked grids and collapseIntervals. Let's consider the following example with a 3d grid and collapseIntervals=[(1, -1)].

tensor<2x3x64x128xf32,
  #tt.metal_layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
    undef,
    <2x2x4>,
    memref<1x3x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Here we've left the batch dim intact and started collapsing at d1. This enables us to define a 3d grid where the outermost grid dim divides the batch directly. This could map to a 2 device system where the batch dim is evenly divided between 2 devices. Within each device this op runs on a 2x4 grid.

The high level takeaway here is that how a tensor is logically divided up is decoupled from its mapping to physical compute resources. This has a nice property that data parallel extends to any tensor dimension and is captured under the same grid primitive that also divides tensor rows and columns.

Test Plan

  • test/python/tensor_layout.py: Assertions for LayoutAttr to make sure it's spec compliant.
  • Sweep tests:
    • Grid dim sweeps
    • Tilize / untilize sweeps
    • Padding sweeps
  • Multi-device tests

Concerns

  • tt.metal_layout is deliberately flexible and tries to capture as many problematic use-cases we've ran into in the past in a single, succinct representation. This flexibility will need to be further constrained by backends to avoid unsupported programming of this attribute.
  • Optimization solution space is potentially large with all of this flexibility. Two things that I hope can help protect us here:
    • By and large the heuristic we'll be following is just max the grid at all costs. This should really narrow down the solution space to only a handful of options and we only keep exploring if producers/consumers end up with nasty reblocking.
    • We can constrain the optimizer heuristics as aggressively as possible in the beginning and just advertise the full flexible options to the UI model explorer. Hopefully this enables us to experiment with crazier grid layouts and prove it's worthwhile before writing an algorithm.







TTNN Tensor Layout

The above section of this document covers how the compiler models tensor layout. There are some slight differences in TTNN, but the high level idea of collapsing dims is still used.

Terms

  • shape: Always logical shape, n-dimensional
  • stride: Same as pytorch stride, but this is crucial for describing how n-dimensional data gets packed into a 2D physical layout. This 2D physical layout is always the inner dim (-1) wide and dims [0, N-1] are collapsed into rows derived from stride
  • shard_shape: Also a logical shape, describes a 2d region that chunks physical_shape . Note this does not need to be a tile multiple
  • physical_shard_shape: The shard_shape padded out to tile_shape
  • tile_shape: A programmable tile shape, though constraints must check that it's compatible with an op's usage, i.e. FPU/Noc compatible
  • grid_shape: [divup(stride[0] // stride[-2], shard_shape[0]), divup(stride[-2], shard_shape[0])]

Mapping from the compiler

The compiler uses an affine map to explicitly track which dimensions are folded together, but TTNN does not have affine maps so the representation is a bit more implicit. TTNN captures the dimension collapsing in the stride attribute where dimensions [0, N-1] are always collapsed. This is less flexible so the compiler will have to enforce only collapsing supported dimensions when targeting TTNN, or handle lowering in a different way. For example, in the compiler we might want to represent data parallel over the tensor batch dim by leaving d0 and collapsing d1 - d[-1]. TTNN doesn't support this in its tensor layout representation, but this could be lowered to a TTNN mesh tensor where the mesh could be sliced on the batch and each per-device tensor has d0 fully collapsed.

TTNN Example

Alt text

Device

Device in tt-mlir is somewhat of an overloaded term and can refer to different things depending on the context. This document will only speak to the compiler's abstract representation of a device captured by attribute #tt.device.

Terms

There are many overloaded terms when talking about devices and grids, this document will use the following definitions:

  • Physical Grid: A 2D array of tensix cores on a chip.
  • Chip: A single physical chip with a Physical Grid of cores.
  • Card: A PCIE or Ethernet card that may contain multiple Chips.
  • System: A collection of Cards that are usually connected together on the same host via PCIE or networked via ethernet. A system is represented by SystemDesc in the compiler.
  • Device: Device is always presented as a single entity to the enclosing scope, but it may be virtualized to abstract a multi-card System and part of its encoding carries a Logical Grid. Another way to think of device is a view over the system.
  • Logical Grid or just Grid: Is a logical shape that abstracts one or more Physical Grids.
  • Mesh Shape: Describes the virtual layout of the chips with respect to each other. In practice the mesh shape is used to derive the logical grid.

Motivation

The device attribute strives to achieve the following goals:

  • Provide a convenient representation of a physical grid that decouples the logical division of tensors from the physical layout of the hardware. This not only simplifies reasoning about how tensors get divided into shards, but can also enable reinterpretations of the device grid for data layout optimization decoupled from the existing encoding of the tensor layouts.
  • Following the first point, the device attribute should be able to represent many different forms of logical grids, from simple 2D grids, to more complex topologies like extra-wide grids or higher dimensional grids.
  • Device attribute captures encoding both single chip and multi-chip systems under a single, virtualized representation.
  • Enable many forms of data parallel execution strategies for single and multi chip systems under a single representation.

Scope

This document will cover how the device attribute is encoded and how it can be lowered to backend dialects. The document will not cover the algorithm for choosing the best, or even legal, device configurations for a given physical system.

Examples

All of the following examples will assume the physical hardware has an 8x8 physical grid of cores. We will use notation [N, 8x8] to represent a N chip system, each with an 8x8 physical grid.

#tt.device in is simplest, single chip form [1, 8x8], just maps directly 1-1 to the underlying physical hardware device.

#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>,
  meshShape = 1,
  chipIds = [0]
>

Let's break down what each of these attributes mean:

  • workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>: This is a 2D logical grid with dim 8x8. It's followed by an affine map (d0, d1) -> (0, d0, d1) that provides a mapping from the logical grid to the physical grid. In this case, the logical grid is the same as the physical grid, so the mapping is the identity function. The logical grid can have any rank, but the physical mapping is always 3D, with the first being the chip index, followed by the 2D physical core index within the chip.
  • meshShape = 1: A shape provided as part of the DeviceAttr constructor that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.
  • chipIds = [0]: This is a list of chip indices. These chip indices directly reference the same chip indices in the system descriptor. The SystemDesc attribute that this is in reference to is tagged on the top level ModuleOp.

Specific examples that this document will cover:

Before we move on to more complex examples, it's worth having on hand:

  • The python test test/python/device_attr.py which shows how all of these examples can actually be programmed for the device attribute.
  • The Tensor Layout spec as the following examples will demonstrate how tensor layout interacts with the logical device grid.

Note on Data Parallel: There is existing literature that explicitly distinguishes between data parallel and tensor parallel, oftentimes describing data parallel as duplicating the model across multiple devices and trivially dividing up the batch whereas tensor parallel refers to tensor data being distributed and potentially communicated between devices during execution. While this is true for multi-GPU/CPU systems, it is somewhat of an implementation detail and given the flexibility of tenstorrent hardware there is an opportunity to generalize this concept. In this document we will use the term data parallel to refer to any form of parallelism that divides any dimension of the tensor across multiple cores/chips.

Note on Constraints: Many of the examples below require careful virtualization of the underlying physical system, i.e. some device configurations might only work if the chips are connected via ethernet and with a particular topology, but these constraints are outside the scope of the examples and will be discussed further in the Backend Lowering and Constraints section.

Data Parallel Over Batch

Given a 2 chip system, [2, 8x8], we can represent a simple data parallel logical grid that divides the batch dimension in half across the two chips. This is denoted by meshShape = 2x1x1 which means the logical grid is 3D.

#tt.device<
  workerGrid = #tt.grid<2x8x8, (d0, d1, d2) -> (d0, d1, d2)>,
  meshShape = 2x1x1,
  chipIds = [0, 1]
>

The affine map here is just identity, so dims d1 and d2 directly index the physical grid and d0 indexes the chip.

Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:

tensor<16x3x64x128xf32,
  #tt.metal_layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
    undef,
    <2x2x4>,
    memref<8x3x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

If we map this tensor onto the above device, it will span across both chips, half of the batch dimension on each chip. Within each chip the tensor occupies a 2x4 grid out of the 8x8 physical grid available.

Data Parallel Over 2d

In this example we will consider a 2 chip system, [2, 8x8], and view it as though the two chips are concatenated together side by side to form a single 8x16 grid. This is denoted by meshShape = 1x2 which means to concatenate the chips in the second dimension.

#tt.device<
  workerGrid = #tt.grid<8x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0, d1 mod 8)>,
  meshShape = 1x2,
  chipIds = [0, 1]
>

Here we can see that the affine map encodes an indexing pattern such that when we extend past 8 cores in the second dimension, we wrap around to the next chip.

Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:

tensor<256x1024xf32,
  #tt.metal_layout<(d0, d1) -> (d0, d1),
    undef,
    <4x16>,
    memref<2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

This single tensor maps trivially onto the logical grid, spanning the upper half. Decoupled from the tensor's layout, under the hood the tensor is actually physically spanning across two chips.

Data Parallel Over 2d and Batch

The previous 2 examples can be composed together to form a logical grid that divides tensor across multiple dimensions. Here we will consider a 4 chip system [4, 8x8] and view it as a 2x8x16 grid. Note that the meshShape is 2x1x2 which means to concatenate the chips in the first and third dimensions.

#tt.device<
  workerGrid = #tt.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
  meshShape = 2x1x2,
  chipIds = [0, 1, 2, 3]
>

We can evaluate the affine map to see that the chips are interpreted in chunks of two, where groups [0, 1] and [2, 3] each form 8x16 grids and these 2 groups concatenate to form a 2x8x16 grid.

We can consider the following tensor to map onto this grid:

tensor<64x256x1024xf32,
  #tt.metal_layout<(d0, d1) -> (d0, d1),
    undef,
    <2x4x16>,
    memref<32x2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
  >
>

Pipeline Parallel

Pipeline parallel in the scope of this spec isn't particularly interesting, it is intended to be used in conjunction with the ttir.pipeline operation which will group sections of the module's operations into groups to form pipeline regions and will be covered in a separate spec.

What we can demonstrate here is how we can take multiple non-overlapping views of the system descriptor to form distinct virtual devices.

Given an 8 chip system [8, 8x8], we can form two virtual devices that each take 4 chips and interpret them differently (though they could take the same logical grid).

#tt.device<
  workerGrid = #tt.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
  meshShape = 2x1x2,
  chipIds = [0, 1, 2, 3]
>
#tt.device<
  workerGrid = #tt.grid<16x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0 mod 8, d1 mod 8)>,
  meshShape = 2x2,
  chipIds = [4, 5, 6, 7]
>

Reinterpreted Grids (Transpose)

One particularly interesting usecase that logical grids could enable is to reinterpret the grid as a form of data layout optimization. For example, if we wanted to transpose a tensor, instead of having to move the data around to implement transpose, we could instead reinterpret the grid as being transposed, leveraging the fact that the relevant data is already located on the correct cores/chips.

To keep things simple, let's consider a 1 chip system [1, 8x8], but it's not too big a leap to see how this could map to multi-chip where the cost of moving data is even higher.

Let's also consider a simple (totally contrived) eltwise unary graph:

a = exp(a)
aT = transpose(a)
relu(aT)
  1. We'll establish a regular, single chip, identity logical grid:
#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>,
  meshShape = 1,
  chipIds = [0]
>
  1. Execute exp.
  2. We'll reinterpret the grid as transposed:
#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d1, d0)>,
  meshShape = 1,
  chipIds = [0]
>
  1. Execute transpose. Note that each core only needs to transpose their data locally. Eventually this could be implemented as a no-op by reindexing the tile visitation order of the successive operation.
  2. Execute relu.

It's important to note that we effectively implemented transpose without moving data anywhere.

Reinterpreted Grids (Extra)

For the sake of examples, here's a few more ways of reinterpreting the logical grid.

Extra Wide Grid

#tt.device<
  workerGrid = #tt.grid<1x64, (d0, d1) -> (0, d0 * 8 + d1 floordiv 8, d1 mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

Extra Tall + Transposed Grid

#tt.device<
  workerGrid = #tt.grid<64x1, (d0, d1) -> (0, d1 * 8 + d0 floordiv 8, d0 mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

Staircase

#tt.device<
  workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, (d0 + d1) mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

This could be an interesting starting position for data in implementing matmul as a systolic array in a ring topology.

Lowering to TTNN

While the above device attribute encoding is quite flexible, this does not necessarily mean the target backend can actually support all of these interpretations. TTNN backend will be constrained to support only the specialized grid topologies that are supported by the API.

Grid/Shard Orientation

TODO

Multi-device

Please refer to TTNN Mesh Programming Docs for more information on how to program multi-device systems with TTNN API.

Multi-device TTNN dialect will try and stay as close to the TTNN API as possible. Let's consider what this looks like from the compiler and runtime perspectives:

Compiler

  • Device Creation: The TTNN device in the compiler is exactly the same attribute from the ttir dialect. It will encode the meshShape into the flatbuffer which can be directly used to program ::ttnn::MeshShape.
  • Tensor Layout: Again, the tensor layout is inherited in TTNN dialect from the ttir dialect. The grid attribute in the tensor layout can be trivially divided by meshShape to determine the shape of the tensor slice on each device. Broadcasting rules can be applied to determine which Distribution Strategy to use:
    • Mesh Sharded: If the tensor grid is > 1 along the meshShape dimensions, the tensor will be sharded across the mesh devices.
    • Replication: If the tensor needs to be broadcasted for this op, by extension the tensor layout will be replicated across the mesh devices.

Runtime

  • Device Creation: The ttnn runtime will wholesale switch to working with mesh devices via api ttnn::multi_device::open_mesh_device, this is possible because a 1x1 mesh device is a valid single device. The mesh shape during device open will always be 1xN where N is the number of deviceIds in the array. Note that this shape can be reinterpreted by flatbuffer programs on the fly with SubMesh API.
  • Tensor Creation: Tensor creation in a multi-device system is a bit more involved. In order to upload a multi-device tensor to the mesh, the host tensor much first be created with MultiDeviceHostStorage. The ttnn runtime can automatically do this during handleToHostMemoryConfigOp:
    • Regular host tensor will bounce through new tensor with MultiDeviceHostStorage type.
    • tensor.to(mesh_device) will allocate/move the tensor to the mesh device.

Lowering to TTMetal

In TTMetal dialect we are only constrained by what we've implemented in the tt-mlir compiler, this means it is much more flexible and can theoretically support any of the grid interpretations above.

Test Plan

  • test/python/device_attr.py covers all of the examples above and asserts the IR is correctly generated.
  • Additional functional unit tests will be added as op and runtime support is added.

Concerns

  • tt.device is very flexible, but with this flexibility comes the potential for misuse. It's important that the compiler is able to validate the legal configurations of this attribute for the target backend.

'tt' Dialect

TT types and attributes common to all TT dialects.

This dialect defines types and attributes common to all TT dialects.

[TOC]

ArchAttr

TT Arch

Syntax:

#tt.arch<
  ::mlir::tt::Arch   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::Archan enum of type Arch

ArgumentAllocationAttr

Argument allocation attribute in TT dialect

Syntax:

#tt.arg_alloc<
  uint64_t,   # address
  uint64_t,   # size
  MemorySpace   # memorySpace
>

Holds the metadata for the allocation of an function argument i.e. for graph inputs.

Parameters:

ParameterC++ typeDescription
addressuint64_t
sizeuint64_t
memorySpaceMemorySpace

ArgumentTypeAttr

Argument Type

Syntax:

#tt.argument_type<
  ::mlir::tt::ArgumentType   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::ArgumentTypean enum of type ArgumentType

CPUDescAttr

TT cpu_desc attribute

Syntax:

#tt.cpu_desc<
  CPURole,   # role
  StringAttr   # target_triple
>

TT cpu_desc attribute

Parameters:

ParameterC++ typeDescription
roleCPURole
target_tripleStringAttr

CPURoleAttr

TT CPU Role

Syntax:

#tt.cpu_role<
  ::mlir::tt::CPURole   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::CPURolean enum of type CPURole

ChipChannelAttr

TT chip_channel attribute

Syntax:

#tt.chip_channel<
  unsigned,   # deviceId0
  ::llvm::ArrayRef<int64_t>,   # ethernetCoreCoord0
  unsigned,   # deviceId1
  ::llvm::ArrayRef<int64_t>   # ethernetCoreCoord1
>

TT chip_channel attribute

Parameters:

ParameterC++ typeDescription
deviceId0unsigned
ethernetCoreCoord0::llvm::ArrayRef<int64_t>
deviceId1unsigned
ethernetCoreCoord1::llvm::ArrayRef<int64_t>

ChipCoordAttr

TT chip_coord attribute

Syntax:

#tt.chip_coord<
  unsigned,   # rack
  unsigned,   # shelf
  unsigned,   # y
  unsigned   # x
>

TT chip_coord attribute

Parameters:

ParameterC++ typeDescription
rackunsigned
shelfunsigned
yunsigned
xunsigned

ChipDescAttr

TT chip_desc attribute

Syntax:

#tt.chip_desc<
  ArchAttr,   # arch
  ::llvm::ArrayRef<int64_t>,   # grid
  ::llvm::ArrayRef<int64_t>,   # coordTranslationOffsets
  unsigned,   # l1Size
  unsigned,   # numDramChannels
  unsigned,   # dramChannelSize
  unsigned,   # nocL1AddressAlignBytes
  unsigned,   # pcieAddressAlignBytes
  unsigned,   # nocDRAMAddressAlignBytes
  unsigned,   # l1UnreservedBase
  unsigned,   # eriscL1UnreservedBase
  unsigned,   # dramUnreservedBase
  unsigned,   # dramUnreservedEnd
  ChipPhysicalHelperCoresAttr,   # chipPhysicalHelperCores
  ::llvm::ArrayRef<DataTypeAttr>,   # supportedDataTypes
  ::llvm::ArrayRef<TileSizeAttr>,   # supportedTileSizes
  unsigned,   # dstRegisterSizeTiles
  unsigned,   # numCBs
  unsigned,   # numComputeThreads
  unsigned   # numDatamovementThreads
>

TT chip_desc attribute

Parameters:

ParameterC++ typeDescription
archArchAttr
grid::llvm::ArrayRef<int64_t>
coordTranslationOffsets::llvm::ArrayRef<int64_t>
l1Sizeunsigned
numDramChannelsunsigned
dramChannelSizeunsigned
nocL1AddressAlignBytesunsigned
pcieAddressAlignBytesunsigned
nocDRAMAddressAlignBytesunsigned
l1UnreservedBaseunsigned
eriscL1UnreservedBaseunsigned
dramUnreservedBaseunsigned
dramUnreservedEndunsigned
chipPhysicalHelperCoresChipPhysicalHelperCoresAttr
supportedDataTypes::llvm::ArrayRef<DataTypeAttr>
supportedTileSizes::llvm::ArrayRef<TileSizeAttr>
dstRegisterSizeTilesunsigned
numCBsunsigned
numComputeThreadsunsigned
numDatamovementThreadsunsigned

ChipPhysicalHelperCoresAttr

TT chip_physical_helper_cores attribute

Syntax:

#tt.chip_physical_helper_cores<
  ::llvm::ArrayRef<CoreCoordAttr>,   # dram
  ::llvm::ArrayRef<CoreCoordAttr>,   # eth
  ::llvm::ArrayRef<CoreCoordAttr>   # eth_inactive
>

TT chip_physical_helper_cores attribute containing arrays of physical helper cores by core type in order of logical cores.

Parameters:

ParameterC++ typeDescription
dram::llvm::ArrayRef<CoreCoordAttr>
eth::llvm::ArrayRef<CoreCoordAttr>
eth_inactive::llvm::ArrayRef<CoreCoordAttr>

CoreCoordAttr

TT core_coord attribute

Syntax:

#tt.core_coord<
  int64_t,   # y
  int64_t   # x
>

TT core_coord attribute containing a single physical core coordinate.

Parameters:

ParameterC++ typeDescription
yint64_t
xint64_t

DataTypeAttr

TT DataTypes

Syntax:

#tt.supportedDataTypes<
  ::mlir::tt::DataType   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::DataTypean enum of type DataType

DeviceAttr

Device attribute in TT dialect.

Syntax:

#tt.device<
  ::mlir::tt::GridAttr,   # workerGrid
  AffineMap,   # l1Map
  AffineMap,   # dramMap
  ::llvm::ArrayRef<int64_t>,   # meshShape
  ::llvm::ArrayRef<unsigned>   # chipIds
>

Describes the physical layout of a device in the system and is made up of a few components:

  • A grid attribute that describes the device's compute grid shape. It not only describes the shape of the compute grid, but also carries an affine map that describes how the logical grid maps to the physical grid.
  • Two affine maps that describe how a tensor layout's linear attribute maps to the L1 and DRAM memory spaces.
  • A mesh shape that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.
  • An array of chip ids that this device is made up of. This array's length must match the volume of the mesh shape and should be interpreted in row-major order.

Parameters:

ParameterC++ typeDescription
workerGrid::mlir::tt::GridAttrTT grid attribute
l1MapAffineMap
dramMapAffineMap
meshShape::llvm::ArrayRef<int64_t>
chipIds::llvm::ArrayRef<unsigned>

GridAttr

TT grid attribute

Syntax:

#tt.grid<
  ::llvm::ArrayRef<int64_t>,   # shape
  AffineMap   # mapping
>

TT grid attribute

Parameters:

ParameterC++ typeDescription
shape::llvm::ArrayRef<int64_t>
mappingAffineMap

IteratorTypeAttr

TT IteratorType

Syntax:

#tt.iterator_type<
  ::mlir::tt::IteratorType   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::IteratorTypean enum of type IteratorType

MemorySpaceAttr

TT MemorySpace

Syntax:

#tt.memory_space<
  ::mlir::tt::MemorySpace   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::MemorySpacean enum of type MemorySpace

MeshAttr

Mesh reference attribute in TT dialect.

Syntax:

#tt.mesh<
  StringAttr,   # name
  ::llvm::ArrayRef<int64_t>   # shape
>

Describes a mesh config including name and shape.

Parameters:

ParameterC++ typeDescription
nameStringAttr
shape::llvm::ArrayRef<int64_t>

MeshShardDirectionAttr

TT MeshShardDirection

Syntax:

#tt.shard_direction<
  ::mlir::tt::MeshShardDirection   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::MeshShardDirectionan enum of type MeshShardDirection

MeshShardTypeAttr

MeshShard shard_type attribute in TT dialect

Syntax:

#tt.shard_type<
  ::mlir::tt::MeshShardType   # value
>

Define sharded tensor data of mesh_shard op.

  • Identity: input and output tensors are pre-sharded (same data) and no sharding is required.
  • Replicate: all of the devices has full tensor (same data).
  • Maximal: one or part of the devcices has full tensor (same data).
  • Devices: all or part of the devices has sharded (partial) tensor (different data).

Parameters:

ParameterC++ typeDescription
value::mlir::tt::MeshShardTypean enum of type MeshShardType

MeshesAttr

TT system meshes attribute.

Syntax:

#tt.meshes<
  ::llvm::ArrayRef<MeshAttr>   # meshes
>

TT system meshes attribute includes one or more mesh configs used for networks.

Parameters:

ParameterC++ typeDescription
meshes::llvm::ArrayRef<MeshAttr>

MetalLayoutAttr

Tensor layout attribute

Syntax:

#tt.metal_layout<
  AffineMap,   # linear
  OOBVal,   # oob_val
  GridAttr,   # grid
  MemRefType   # memref
>

The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.

Some high level goals

  • Logical shapes: Keep the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
  • Flexible sharding: Enable flexibility in choosing grid shape, to get better parallelization and avoid resharding. This is particularly important in cases where tensor shapes are not clean powers of two and would otherwise force our hand in choosing non-optimal grid shapes.
  • Logical-Physical Isomorphism: Encode this information with just a few attributes to enable derived conversions from logical to physical layout and back.
  • Explicit: A single source of truth.
  • Enable a direct way to query padded regions.

Please refer to the Tensor Layout Spec for more in depth documentation.

Examples:

tensor<8x300xf32,
  #tt.metal_layout<(d0, d1) -> (d0, d1),
    undef,
    <1x2>,
    memref<8x150xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.metal_layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
    undef,
    <2x1>,
    memref<384x32xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.metal_layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
    undef,
    <2x1x2>,
    memref<384x96x16xf32, #tt.memory_space<l1>>
  >
>

tensor<5x3x2x2x7x32x32xf32,
  #tt.metal_layout<
    (d0, d1, d2, d3, d4, d5, d6)
      -> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
    undef,
    <3x2x2x2>,
    memref<4480x4x16x16xf32, #tt.memory_space<l1>>
  >
>

Parameters:

ParameterC++ typeDescription
linearAffineMapAn affine map that defines how the logical tensor dimensions map to a grid shape.
oob_valOOBValA tracked out of bounds value that fills padding space.
gridGridAttrThe grid shape that this tensor is divided onto.
memrefMemRefTypeA memref that describes the physical footprint allocation of the shard. It must also have a shape with rank equal to grid.

OOBValAttr

TT OOBVal

Syntax:

#tt.oob_val<
  ::mlir::tt::OOBVal   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::OOBValan enum of type OOBVal

ReduceTypeAttr

TT Reduce Type

Syntax:

#tt.reduce_type<
  ::mlir::tt::ReduceType   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::ReduceTypean enum of type ReduceType

ShardLayoutAttr

Shard layout attribute in TT dialect

Syntax:

#tt.shard<
  ::llvm::ArrayRef<int64_t>,   # stride
  uint32_t   # buffers
>

Describes shard layout of a memref buffer.

  • Stride: Stride of each dim in bytes.
  • Buffers: Number of back buffers used for double buffering, I/O latency hiding, etc

The shard layout attribute is a description of how each shard of a memref is laid out in memory. Memref's with this layout type implicitly mean their data is distributed across a grid of cores.

Parameters:

ParameterC++ typeDescription
stride::llvm::ArrayRef<int64_t>
buffersuint32_t

SystemDescAttr

TT system_desc attribute

Syntax:

#tt.system_desc<
  ::llvm::ArrayRef<CPUDescAttr>,   # cpuDescs
  ::llvm::ArrayRef<ChipDescAttr>,   # chipDescs
  ::llvm::ArrayRef<unsigned>,   # chipDescIndices
  ::llvm::ArrayRef<ChipCapabilityAttr>,   # chipCapabilities
  ::llvm::ArrayRef<ChipCoordAttr>,   # chipCoords
  ::llvm::ArrayRef<ChipChannelAttr>   # chipChannels
>

TT system_desc attribute

Parameters:

ParameterC++ typeDescription
cpuDescs::llvm::ArrayRef<CPUDescAttr>
chipDescs::llvm::ArrayRef<ChipDescAttr>
chipDescIndices::llvm::ArrayRef<unsigned>
chipCapabilities::llvm::ArrayRef<ChipCapabilityAttr>
chipCoords::llvm::ArrayRef<ChipCoordAttr>
chipChannels::llvm::ArrayRef<ChipChannelAttr>

TensorMeshShardingAttr

Tensor mesh sharding attribute in TT dialect.

Syntax:

#tt.mesh_sharding<
  StringAttr,   # name
  ::llvm::ArrayRef<TensorMeshShardingAxisAttr>   # tensor_mesh_sharding_axis
>

Describes a tensor's multi-device status.

  • Single device tensor has no TensorMeshShardingAttr. tensor<784x16384xf32>

  • Multi-device tensors have TensorMeshShardingAttr. (i) multi-device tensor without tensor mesh shard axis indicates all devices in "mesh" have full size tensors e.g., 784x16384 for tensor<784x16384xf32, #tt.mesh_sharding<"mesh">>

    (ii) multi-device tensor with tensor mesh shard axis indicate all devices in "mesh" have sharded tensor defined by the TensorMeshShardingAxisAttr. e.g., 192x16384 for tensor<784x16384xf32, #tt.mesh_sharding<"mesh", [ 4(1), 1]>>. Here, 4(1) indicates shard_shape(shard_dim), so 784 should be sharded by 4 at "mesh"'s second hardware dimension. 1 indicates no sharding, so 16384 is not being sharded.

Parameters:

ParameterC++ typeDescription
nameStringAttr
tensor_mesh_sharding_axis::llvm::ArrayRef<TensorMeshShardingAxisAttr>

TensorMeshShardingAxisAttr

Tensor mesh sharding axis info attribute in TT dialect.

Syntax:

#tt.tensor_sharding<
  int64_t,   # shard_shape
  ::llvm::ArrayRef<int64_t>   # axes
>

Details per tensor dimension sharding and axes info.

  • shard_shape: shard shape at a tensor dimension.
  • (optional) axes: mesh shard dimensions. Axes may be empty if it is not being sharded.

Parameters:

ParameterC++ typeDescription
shard_shapeint64_t
axes::llvm::ArrayRef<int64_t>

TileSizeAttr

TT tile_size attribute

Syntax:

#tt.tile_size<
  int64_t,   # y
  int64_t   # x
>

TT tile_size attribute containing a supported Tensix tile shape.

Parameters:

ParameterC++ typeDescription
yint64_t
xint64_t

ViewLayoutAttr

View layout attribute in TT dialect

Syntax:

#tt.view<
  AffineMap   # affineMap
>

Describes a view layout of a memref buffer.

  • AffineMap: Provides affine map indexing into the associated data view.

Only the view_layout or stream_layout ops should return memref's with this attribute. The view layout attribute is necessary for two reasons:

  • It provides a way to reblock the data view into a different shape (via affine map). Usually this would be some subblock of the original backing memory to chunk the data into smaller pieces.
  • The type itself is a signal to datamovement passes that the memref is a view and should be treated as such.

Parameters:

ParameterC++ typeDescription
affineMapAffineMap

tt.cpu_module (tt::CPUModuleOp)

Module-wrapper operation for CPU ops

Syntax:

operation ::= `tt.cpu_module` attr-dict-with-keyword regions

Custom module operation that can a single ModuleOp, which should contain all funcs which should be run on CPU.

Example:

tt.cpu_module {
  module {
    func.func foo() { ... }
  }
}

Traits: IsolatedFromAbove, NoRegionArguments, NoTerminator, SingleBlock, SymbolTable

tt.device_module (tt::DeviceModuleOp)

Module-wrapper operation for device ops

Syntax:

operation ::= `tt.device_module` attr-dict-with-keyword $bodyRegion

Custom module operation that can a single ModuleOp, which should contain all funcs which should be run on device.

Example:

tt.device_module {
  module {
    func.func foo() { ... }
  }
}

Traits: IsolatedFromAbove, NoRegionArguments, NoTerminator, SingleBlock, SymbolTable

tt.device (tt::DeviceOp)

Named device

Syntax:

operation ::= `tt.device` $sym_name `=` $device_attr attr-dict

Interfaces: Symbol

Attributes:

AttributeMLIR TypeDescription
sym_name::mlir::StringAttrstring attribute
device_attr::mlir::tt::DeviceAttr
Device attribute in TT dialect.{{% markdown %}} Describes the physical layout of a device in the system and is made up of a few components: - A grid attribute that describes the device's compute grid shape. It not only describes the shape of the compute grid, but also carries an affine map that describes how the logical grid maps to the physical grid. - Two affine maps that describe how a tensor layout's linear attribute maps to the L1 and DRAM memory spaces. - A mesh shape that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology. - An array of chip ids that this device is made up of. This array's length must match the volume of the mesh shape and should be interpreted in row-major order. {{% /markdown %}}

tt.get_tuple_element (tt::GetTupleElementOp)

GetTupleElement operation

Syntax:

operation ::= `tt.get_tuple_element` $operand `[` $index `]` attr-dict `:` functional-type(operands, results)

Extracts element at index position of the operand tuple and produces a result.

Example:

%result = tt.get_tuple_element %operand[0] : (tuple<tensor<32x32xbf16>, tensor<1x32xf32>>) -> tensor<32x32xbf16>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
index::mlir::IntegerAttr32-bit signless integer attribute whose value is non-negative

Operands:

OperandDescription
operandnested tuple with any combination of ranked tensor of any type values values

Results:

ResultDescription
resultranked tensor of any type values

tt.load_cached (tt::LoadCachedOp)

Load cached results from a previously computed function

Syntax:

operation ::= `tt.load_cached` `(` $callee `,` `[` $inputs `]` `)` attr-dict `:` functional-type($inputs, $results)

The load_cached operation calls a precomputed function with given arguments and returns its results. This is typically used to load constant or hoisted computation results.

Example:

%0, %1, %2 = "tt.load_cached"(@forward_const_eval_1, [%arg0, %arg2])

Attributes:

AttributeMLIR TypeDescription
callee::mlir::FlatSymbolRefAttrflat symbol reference attribute

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

tt.tuple (tt::TupleOp)

Tuple operation

Syntax:

operation ::= `tt.tuple` $operands attr-dict `:` custom<TupleOpType>(type($operands), type($result))

Produces a result tuple from operands operands.

Example:

%result = tt.tuple %operand0, %operand1 : tuple<tensor<32xbf16, tensor<1x32xf32>>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
operandsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultnested tuple with any combination of ranked tensor of any type values values

TileType

TT tile

Syntax:

!tt.tile<
  ::llvm::ArrayRef<int64_t>,   # shape
  DataType   # dataType
>

Tile type in TT dialect

Parameters:

ParameterC++ typeDescription
shape::llvm::ArrayRef<int64_t>
dataTypeDataType

'ttir' Dialect

TTIR dialect provides high level semantics for dispatching work to TT HW.

This dialect provides high level semantics for dispatching work to TT HW. It defines a set of declarative/high level operations that are used to describe the dispatch, but is largely agnostic to the set of operations or dialects that are actually supported by a consuming backend.

[TOC]

ttir.abs (tt::ttir::AbsOp)

Elementwise absolute value operation.

The abs operation computes the absolute value of each element in the input tensor.

For each element, it returns the magnitude of the value without regard to its sign:

  • For real numbers, it returns |x| (the non-negative value without sign)

This operation has the idempotence property, meaning that applying it multiple times produces the same result as applying it once: abs(abs(x)) = abs(x). The operation preserves the data type of the input.

Example:

// Compute absolute values of all elements in %input
%result = ttir.abs(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[-2.5,  3.7,  0.0,  1.2], ... ]
// Output tensor:
// [[2.5, 3.7, 0.0, 1.2], ... ]

// Example with integer tensor
%result = ttir.abs(%int_input, %int_output) : tensor<10xi32>, tensor<10xi32> -> tensor<10xi32>
// Input tensor:
// [-5, 0, 3, -2, ...]
// Output tensor:
// [5, 0, 3, 2, ...]

Mathematical definition: abs(x) = |x| = { x if x ≥ 0 -x if x < 0 }

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Idempotence, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.add (tt::ttir::AddOp)

Elementwise addition operation.

The add operation performs an elementwise addition between two tensors.

For each pair of corresponding elements, it adds the elements and places the result in the output tensor.

Example:

// Addition operation
%result = ttir.add(%lhs, %rhs, %output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensors:
// %lhs: [10, 20, 30]
// %rhs: [1, 2, 3]
// Output tensor:
// [11, 22, 33]

// Example with floating point values
%result = ttir.add(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [3.5, 0.0, -1.2]
// %float_rhs: [1.5, 2.0, -3.2]
// Output tensor:
// [5.0, 2.0, -2.0]

Note: The data type of the output tensor matches the data type of the input tensors.

Mathematical definition: add(x, y) = x + y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_FullyBroadcastable

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.all_gather (tt::ttir::AllGatherOp)

All gather operation.

All gather op.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
all_gather_dim::mlir::IntegerAttr32-bit signed integer attribute
cluster_axis::mlir::IntegerAttr32-bit unsigned integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.all_reduce (tt::ttir::AllReduceOp)

AllReduce operation.

AllReduce op.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
reduce_type::mlir::tt::ReduceTypeAttrTT Reduce Type
cluster_axis::mlir::IntegerAttr32-bit unsigned integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.alloc (tt::ttir::AllocOp)

Alloc op.

Tensor Alloc operation

Attributes:

AttributeMLIR TypeDescription
address::mlir::IntegerAttr64-bit signless integer attribute
size::mlir::IntegerAttr64-bit signless integer attribute
memory_space::mlir::tt::MemorySpaceAttrTT MemorySpace

Results:

ResultDescription
resultranked tensor of any type values

ttir.arange (tt::ttir::ArangeOp)

Tensor range generation operation.

The arange operation generates a tensor with evenly spaced values within a given interval.

This operation creates a tensor with values from start to end (exclusive) with a step size of step, along the dimension specified by arange_dimension. It's similar to NumPy's arange function and is useful for creating tensors with regular sequences of values.

Example:

// Generate a 1D tensor with values [0, 1, 2, 3, 4]
%result = ttir.arange() {
    start = 0 : si64,
    end = 5 : si64,
    step = 1 : si64,
    arange_dimension = 0 : i64
} : () -> tensor<5xi64>

// Generate a 1D tensor with values [0.0, 2.0, 4.0, 6.0, 8.0]
%result = ttir.arange() {
    start = 0 : si64,
    end = 10 : si64,
    step = 2 : si64,
    arange_dimension = 0 : i64
} : () -> tensor<5xf32>

// Generate a 2D tensor with the sequence along dimension 0
%result = ttir.arange() {
    start = 0 : si64,
    end = 5 : si64,
    step = 1 : si64,
    arange_dimension = 0 : i64
} : () -> tensor<5x3xi64>
// Result:
// [[0, 0, 0],
//  [1, 1, 1],
//  [2, 2, 2],
//  [3, 3, 3],
//  [4, 4, 4]]

// Generate a 2D tensor with the sequence along dimension 1
%result = ttir.arange() {
    start = 0 : si64,
    end = 3 : si64,
    step = 1 : si64,
    arange_dimension = 1 : i64
} : () -> tensor<5x3xi64>
// Result:
// [[0, 1, 2],
//  [0, 1, 2],
//  [0, 1, 2],
//  [0, 1, 2],
//  [0, 1, 2]]

Attributes:

  • start (Integer): The start value of the sequence.
  • end (Integer): The end value of the sequence (exclusive).
  • step (Integer): The step size between values in the sequence.
  • arange_dimension (Integer): The dimension along which to generate the sequence.

Outputs:

  • result (Tensor): The generated tensor containing the sequence.

Traits: AlwaysSpeculatableImplTrait, TT_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
start::mlir::IntegerAttr64-bit signed integer attribute
end::mlir::IntegerAttr64-bit signed integer attribute
step::mlir::IntegerAttr64-bit signed integer attribute
arange_dimension::mlir::IntegerAttr64-bit signless integer attribute

Results:

ResultDescription
resultranked tensor of any type values

ttir.argmax (tt::ttir::ArgMaxOp)

Argmax reduction op.

Determine the indices of the maximum values along a specified dimension of a tensor or over all elements in a tensor.

This operation reduces the input tensor by finding the index of the maximum value along the dimensions specified in dim_arg. If dim_arg is not provided, the argmax is computed over all dimensions, resulting in a scalar index. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example IR Usage:

// Argmax along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2xi32>
%result = ttir.argmax(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xf32>, tensor<2xi32> -> tensor<2xi32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [2.0, 4.0, 6.0]]
// Output tensor:
// [1, 2]  // Index of maximum value in each row (5.0 in first row, 6.0 in second row)

// Argmax along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<3xi32>
%result = ttir.argmax(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xf32>, tensor<3xi32> -> tensor<3xi32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [2.0, 4.0, 6.0]]
// Output tensor:
// [1, 0, 1]  // Index of maximum value in each column

// Argmax over all dimensions
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<i32>
%result = ttir.argmax(%input, %output) {keep_dim = false} : tensor<2x3xf32>, tensor<i32> -> tensor<i32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [2.0, 4.0, 6.0]]
// Output tensor:
// 5  // Flattened index of the maximum value (6.0)

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • keep_dim (Bool): Whether to keep the reduced dimensions or not.
  • dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

  • output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.atan2 (tt::ttir::Atan2Op)

Elementwise atan2 operation.

The atan2 operation performs an elementwise arc tangent (inverse tangent) operation between two tensors.

For each pair of corresponding elements, it computes the angle in radians between the positive x-axis and the vector from the origin to the point (x, y) in the Cartesian plane. This operation is typically used in trigonometric calculations and supports partial broadcasting, allowing operands of different shapes to be combined.

Example:

// %lhs: [0.0, 1.0, -1.0]
// %rhs: [1.0, 0.0, 0.0]
%result = ttir.atan2(%lhs, %rhs, %output) : tensor<3xf64>, tensor<3xf64>, tensor<3xf64> -> tensor<3xf64>
// %result: [0.0, 1.57079637, -1.57079637] // [0.0, pi/2, -pi/2]

Mathematical definition: atan2(x, y) = arctan(y / x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.atan (tt::ttir::AtanOp)

Eltwise arctangent op.

The atan operation computes the arctangent (inverse tangent) of each element in the input tensor.

For each element, it returns the angle in radians whose tangent is the input value. The operation returns values in the range [-π/2, π/2].

Example:

// Compute arctangent of all elements in %input
%result = ttir.atan(%input, %output) : tensor<4xf32>, tensor<4xf32> -> tensor<4xf32>
// Input tensor:
// [1.0, 0.5, 0.0, -1.0]
// Output tensor:
// [0.785, 0.464, 0.0, -0.785]  // values in radians

// Example with different values
%result = ttir.atan(%float_input, %float_output) : tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [0.0, 1.0, 1000.0]
// Output tensor:
// [0.0, 0.785, 1.571]  // values approach π/2 as input grows

Mathematical definition: atan(x) = tan⁻¹(x), where the result is in the range [-π/2, π/2]

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.avg_pool2d (tt::ttir::AvgPool2dOp)

2D average pooling operation.

The avg_pool2d operation applies a 2D average pooling over an input tensor composed of several input planes.

This operation performs downsampling by dividing the input into local regions and computing the average value of each region. It reduces the spatial dimensions (height and width) of an input tensor while preserving the batch and channel dimensions. This is commonly used in neural networks to reduce the spatial size of feature maps.

Example:

// Basic 2D average pooling with a 2x2 kernel and stride 1
%input = ... : tensor<1x3x3x1xf32>  // 3x3 input tensor with values:
                                    // [[[1, 2, 3],
                                    //   [4, 5, 6],
                                    //   [7, 8, 9]]]]
%output = ttir.empty() : tensor<1x2x2x1xf32>
%result = ttir.avg_pool2d(%input, %output) {
    kernel_height = 2 : i32,
    kernel_width = 2 : i32,
    stride_height = 1 : i32,
    stride_width = 1 : i32,
    dilation_height = 1 : i32,
    dilation_width = 1 : i32,
    ceil_mode = false,
    padding_left = 0 : i32,
    padding_right = 0 : i32,
    padding_top = 0 : i32,
    padding_bottom = 0 : i32
} : tensor<1x3x3x1xf32>, tensor<1x2x2x1xf32> -> tensor<1x2x2x1xf32>
// Result: [[[3, 4],
//           [6, 7]]]]
// Where: 3 = (1+2+4+5)/4, 4 = (2+3+5+6)/4, 6 = (4+5+7+8)/4, 7 = (5+6+8+9)/4

Inputs:

  • input (Tensor): Input tensor in NHWC format (batch, height, width, channels).

Attributes:

  • kernel_height (Integer): Height of the pooling kernel.
  • kernel_width (Integer): Width of the pooling kernel.
  • stride_height (Integer): Stride along the height dimension.
  • stride_width (Integer): Stride along the width dimension.
  • dilation_height (Integer): Dilation factor for height dimension.
  • dilation_width (Integer): Dilation factor for width dimension.
  • ceil_mode (Boolean): When true, uses ceil instead of floor for output shape calculation.
  • padding_left, padding_right, padding_top, padding_bottom (Integer): Padding on each side.

Outputs:

  • result (Tensor): Output tensor after average pooling.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
kernel_height::mlir::IntegerAttr32-bit signed integer attribute
kernel_width::mlir::IntegerAttr32-bit signed integer attribute
stride_height::mlir::IntegerAttr32-bit signed integer attribute
stride_width::mlir::IntegerAttr32-bit signed integer attribute
dilation_height::mlir::IntegerAttr32-bit signed integer attribute
dilation_width::mlir::IntegerAttr32-bit signed integer attribute
ceil_mode::mlir::BoolAttrbool attribute
padding_left::mlir::IntegerAttr32-bit signed integer attribute
padding_right::mlir::IntegerAttr32-bit signed integer attribute
padding_top::mlir::IntegerAttr32-bit signed integer attribute
padding_bottom::mlir::IntegerAttr32-bit signed integer attribute
flattened_compat_info::mlir::tt::ttir::FlattenedCompatInfoAttr
Information for sliding window operations with tensors flattened to (1, 1, N*H*W, C){{% markdown %}} This attribute marks operations that are compatible with flattened tensors. It is used as a marker and doesn't carry any additional data. {{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.batch_norm (tt::ttir::BatchNormOp)

BatchNormInference operation

Performs batch normalization on the input tensor. Normalizes the operand tensor across all dimensions except for the specified dimension (feature dimension) and produces the normalized result.

Inputs:

  • operand (Tensor): The input tensor to be normalized.
  • scale (Tensor): The scale parameter (gamma).
  • offset (Tensor): The offset parameter (beta).
  • mean (Tensor): The pre-computed mean of the input.
  • variance (Tensor): The pre-computed variance of the input.

Attributes:

  • epsilon is a small constant added to variance for numerical stability.
  • dimension specifies which dimension represents the features/channels.
  • training (Bool): Whether the operation is in training mode.

Output:

  • result (Tensor): The normalized output tensor.

Example:

  // Normalize a batch of activations
  %result = ttir.batch_norm(%operand, %scale, %offset, %mean, %variance, %output,
                          epsilon = 0.001, dimension = 1, training = false) :
        (tensor<8x16x32x32xf32>, tensor<16xf32>, tensor<16xf32>,
          tensor<16xf32>, tensor<16xf32>, tensor<8x16x32x32xf32>) -> tensor<8x16x32x32xf32>

Mathematical definition: batch_norm(x, scale, offset, mean, variance, epsilon, dimension) = (x - mean) / sqrt(variance + epsilon) * scale + offset

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
epsilon::mlir::FloatAttr32-bit float attribute
dimension::mlir::IntegerAttr32-bit signless integer attribute
training::mlir::BoolAttrbool attribute

Operands:

OperandDescription
operandranked tensor of any type values
scaleranked tensor of any type values
offsetranked tensor of any type values
meanranked tensor of any type values
varianceranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.bitwise_and (tt::ttir::BitwiseAndOp)

Elementwise bitwise AND.

The bitwise_and operation performs an elementwise bitwise AND operation between two tensors.

For each pair of corresponding elements, it computes the bitwise AND of their binary representations. This operation is typically used with integer data types and has the idempotence property, meaning that applying it twice with the same second operand returns the original result: bitwise_and(bitwise_and(x, y), y) = bitwise_and(x, y).

Example:

// Bitwise AND operation
%result = ttir.bitwise_and(%lhs, %rhs, %output) : tensor<2x2xi32>, tensor<2x2xi32>, tensor<2x2xi32> -> tensor<2x2xi32>
// Input tensors:
// %lhs: [[1, 2], [3, 4]]
// %rhs: [[5, 6], [7, 8]]
// Output tensor:
// [[1, 2], [3, 0]]

// Example with binary representation (for 8-bit integers)
%result = ttir.bitwise_and(%int8_lhs, %int8_rhs, %int8_output) : tensor<4xi8>, tensor<4xi8>, tensor<4xi8> -> tensor<4xi8>
// Input tensors:
// %int8_lhs: [0x0F, 0xAA, 0xFF, 0x00]  (binary: [00001111, 10101010, 11111111, 00000000])
// %int8_rhs: [0xF0, 0x55, 0xFF, 0x00]  (binary: [11110000, 01010101, 11111111, 00000000])
// Output tensor:
// [0x00, 0x00, 0xFF, 0x00]  (binary: [00000000, 00000000, 11111111, 00000000])

Mathematical definition: bitwise_and(x, y) = x & y

Traits: AlwaysSpeculatableImplTrait, TTIR_BinaryIdempotence, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.bitwise_not (tt::ttir::BitwiseNotOp)

Elementwise bitwise NOT.

The bitwise_not operation computes the bitwise NOT (one's complement) of each element in the input tensor.

For each element, it flips all the bits in the binary representation of the value. This operation is typically used with integer data types and has the involution property, meaning that applying it twice returns the original value: bitwise_not(bitwise_not(x)) = x.

Example:

// Bitwise operation with with integer tensors
%result = "ttir.bitwise_not"(%operand, %result) : (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32>
// %operand: [[1, 2], [3, 4]]
// %result: [[-2, -3], [-4, -5]]

// Example with binary representation (for 8-bit integers)
%result = ttir.bitwise_not(%int8_input, %int8_output) : tensor<3xi8>, tensor<3xi8> -> tensor<3xi8>
// Input %int8_input:
// [0, 5, 255]  (binary: [00000000, 00000101, 11111111])
// Output %int8_output:
// [255, 250, 0]  (binary: [11111111, 11111010, 00000000])

Mathematical definition: bitwise_not(x) = ~x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Involution, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.bitwise_or (tt::ttir::BitwiseOrOp)

Elementwise bitwise OR operation.

The bitwise_or operation performs an elementwise bitwise OR operation between two tensors.

For each pair of corresponding elements, it computes the bitwise OR of their binary representations. This operation is typically used with integer data types and has the idempotence property, meaning that applying it twice with the same second operand returns the original result: bitwise_or(bitwise_or(x, y), y) = bitwise_or(x, y).

Example:

// Bitwise OR operation
%result = ttir.bitwise_or(%lhs, %rhs, %output) : tensor<2x2xi32>, tensor<2x2xi32>, tensor<2x2xi32> -> tensor<2x2xi32>
// Input tensors:
// %lhs: [[1, 2], [3, 4]]
// %rhs: [[5, 6], [7, 8]]
// Output tensor:
// [[5, 6], [7, 12]]

// Example with binary representation (for 8-bit integers)
%result = ttir.bitwise_or(%int8_lhs, %int8_rhs, %int8_output) : tensor<4xi8>, tensor<4xi8>, tensor<4xi8> -> tensor<4xi8>
// Input tensors:
// %int8_lhs: [0x0F, 0xAA, 0x00, 0x55]  (binary: [00001111, 10101010, 00000000, 01010101])
// %int8_rhs: [0xF0, 0x55, 0x00, 0xAA]  (binary: [11110000, 01010101, 00000000, 10101010])
// Output tensor:
// [0xFF, 0xFF, 0x00, 0xFF]  (binary: [11111111, 11111111, 00000000, 11111111])

Mathematical definition: bitwise_or(x, y) = x | y

Traits: AlwaysSpeculatableImplTrait, TTIR_BinaryIdempotence, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.bitwise_xor (tt::ttir::BitwiseXorOp)

Elementwise bitwise XOR operation.

The bitwise_xor operation performs an elementwise bitwise XOR (exclusive OR) operation between two tensors.

For each pair of corresponding elements, it computes the bitwise XOR of their binary representations. This operation is typically used with integer data types and has the property that when applied twice with the same second operand, it returns the original input: bitwise_xor(bitwise_xor(x, y), y) = x.

Example:

// Bitwise XOR operation
%result = ttir.bitwise_xor(%lhs, %rhs, %output) : tensor<2x2xi32>, tensor<2x2xi32>, tensor<2x2xi32> -> tensor<2x2xi32>
// Input tensors:
// %lhs: [[1, 2], [3, 4]]
// %rhs: [[5, 6], [7, 8]]
// Output tensor:
// [[4, 4], [4, 12]]

// Example with binary representation (for 8-bit integers)
%result = ttir.bitwise_xor(%int8_lhs, %int8_rhs, %int8_output) : tensor<4xi8>, tensor<4xi8>, tensor<4xi8> -> tensor<4xi8>
// Input tensors:
// %int8_lhs: [0x0F, 0xAA, 0xFF, 0x00]  (binary: [00001111, 10101010, 11111111, 00000000])
// %int8_rhs: [0xF0, 0x55, 0xFF, 0x00]  (binary: [11110000, 01010101, 11111111, 00000000])
// Output tensor:
// [0xFF, 0xFF, 0x00, 0x00]  (binary: [11111111, 11111111, 00000000, 00000000])

Mathematical definition: bitwise_xor(x, y) = x ^ y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.broadcast (tt::ttir::BroadcastOp)

Broadcast operation.

The broadcast operation expands the dimensions of an input tensor according to specified broadcast dimensions.

This operation takes an input tensor and broadcasts it to a larger shape by repeating elements along dimensions where the input has size 1 and the output has a larger size. This is commonly used to make tensors compatible for elementwise operations.

Example:

// Broadcast a tensor from shape [1, 1, 32] to [1, 16, 32]
%input = ... : tensor<1x1x32xf32>
%output = ttir.empty() : tensor<1x16x32xf32>
%result = ttir.broadcast(%input, %output) {broadcast_dimensions = [1, 16, 1]} :
    tensor<1x1x32xf32>, tensor<1x16x32xf32> -> tensor<1x16x32xf32>
// The input tensor is repeated 16 times along the second dimension

// Broadcast a tensor from shape [1, 3] to [2, 3]
%input = ... : tensor<1x3xf32>
%output = ttir.empty() : tensor<2x3xf32>
%result = ttir.broadcast(%input, %output) {broadcast_dimensions = [2, 1]} :
    tensor<1x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// The input tensor is repeated 2 times along the first dimension

Note: Currently, when generating a TTNN executable, the broadcast and repeat operations share the same semantics due to the lack of tensor view support in TTNN. As a result, the broadcast operation is lowered to a repeat operation in the TTNN compilation pipeline.

Inputs:

  • input (Tensor): The input tensor to broadcast.

Attributes:

  • broadcast_dimensions (Array of Integer): The number of times to broadcast the tensor along each dimension.

Outputs:

  • result (Tensor): The broadcasted tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
broadcast_dimensions::mlir::DenseI64ArrayAttri64 dense array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.cbrt (tt::ttir::CbrtOp)

Elementwise cubic root operation.

The cbrt operation computes the cubic root (∛) of each element in the input tensor.

For each element, it returns the real-valued number that, when cubed, equals the input value. Unlike square root, cubic root is defined for negative numbers as well as positive numbers.

Example:

// Compute cubic root of all elements in %input
%result = ttir.cbrt(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[8.0, 27.0, -8.0, 1.0], ... ]
// Output tensor:
// [[2.0, 3.0, -2.0, 1.0], ... ]

// Example with different values
%result = ttir.cbrt(%float_input, %float_output) : tensor<3x2xf32>, tensor<3x2xf32> -> tensor<3x2xf32>
// Input tensor:
// [[125.0, -27.0],
//  [0.0, 0.001],
//  [1000.0, -1.0]]
// Output tensor:
// [[5.0, -3.0],
//  [0.0, 0.1],
//  [10.0, -1.0]]

Mathematical definition: cbrt(x) = ∛x = x^(1/3)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.ceil (tt::ttir::CeilOp)

Elementwise ceiling operation.

The ceil operation computes the ceiling (smallest integer greater than or equal to x) of each element in the input tensor.

For each element, it rounds the value up to the nearest integer. The operation preserves the data type of the input.

This operation has the idempotence property, meaning that applying it multiple times produces the same result as applying it once: ceil(ceil(x)) = ceil(x).

Example:

// Compute ceiling of all elements in %input
%result = ttir.ceil(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[2.0, 2.0, 0.0, 5.0], ... ]

// Example with different values
%result = ttir.ceil(%float_input, %float_output) : tensor<3x2xf32>, tensor<3x2xf32> -> tensor<3x2xf32>
// Input tensor:
// [[3.14, -2.5],
//  [0.0, 0.001],
//  [9.999, -0.0]]
// Output tensor:
// [[4.0, -2.0],
//  [0.0, 1.0],
//  [10.0, 0.0]]

Mathematical definition: ceil(x) = ⌈x⌉ = min{n ∈ ℤ | n ≥ x}

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Idempotence, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.clamp_scalar (tt::ttir::ClampScalarOp)

Scalar value clamping operation.

The clamp_scalar operation constrains all elements of a tensor to be within a specified range.

This operation applies element-wise clamping to the input tensor, ensuring that all values fall within the range [min, max]. Values less than min are set to min, and values greater than max are set to max. This is commonly used to ensure that tensor values stay within a valid range.

Example:

// Clamp values to the range [2.0, 5.0]
%input = ... : tensor<1x8xf32>  // Input tensor with values:
                                // [[0, 1, 2, 3, 4, 5, 6, 7]]
%output = ttir.empty() : tensor<1x8xf32>  // Output tensor shape
%result = ttir.clamp_scalar(%input, %output) {
    min = 2.0 : f32,  // Minimum value
    max = 5.0 : f32   // Maximum value
} : tensor<1x8xf32>, tensor<1x8xf32> -> tensor<1x8xf32>
// Result: [[2, 2, 2, 3, 4, 5, 5, 5]]
// Values < 2.0 are clamped to 2.0, values > 5.0 are clamped to 5.0

Inputs:

  • input (Tensor): The input tensor to clamp.

Attributes:

  • min (Float): The minimum value for clamping.
  • max (Float): The maximum value for clamping.

Outputs:

  • result (Tensor): The clamped tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
min::mlir::FloatAttr32-bit float attribute
max::mlir::FloatAttr32-bit float attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.clamp_tensor (tt::ttir::ClampTensorOp)

Tensor value clamping operation.

The clamp_tensor operation constrains elements of a tensor to be within ranges specified by min and max tensors.

Unlike clamp_scalar, which uses scalar values for min and max, this operation uses tensor values for element-wise clamping. Each element in the input tensor is clamped between the corresponding elements in the min and max tensors. This allows for different clamping ranges for different elements.

Example:

// Clamp values using min and max tensors
%input = ... : tensor<1x8xf32>  // Input tensor with values:
                                // [[0, 1, 2, 3, 4, 5, 6, 7]]
%min = ... : tensor<1x8xf32>    // Min tensor with values:
                                // [[2, 2, 2, 3, 3, 3, 0, 0]]
%max = ... : tensor<1x8xf32>    // Max tensor with values:
                                // [[5, 5, 5, 9, 9, 9, 6, 6]]
%output = ttir.empty() : tensor<1x8xf32>  // Output tensor shape
%result = ttir.clamp_tensor(%input, %min, %max, %output) :
    tensor<1x8xf32>, tensor<1x8xf32>, tensor<1x8xf32>, tensor<1x8xf32> -> tensor<1x8xf32>
// Result: [[2, 2, 2, 3, 4, 5, 6, 6]]
// Each element is clamped between its corresponding min and max values

Inputs:

  • input (Tensor): The input tensor to clamp.
  • min (Tensor): The tensor containing minimum values for clamping.
  • max (Tensor): The tensor containing maximum values for clamping.

Outputs:

  • result (Tensor): The clamped tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
minranked tensor of any type values
maxranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.collective_permute (tt::ttir::CollectivePermuteOp)

Collective permute operation.

Collective permute op. This operation ingests a multi-device tensor spread across multi-devices and will shuffle the data according to source_target_pairs [['src', 'dest']].

Example: For a 1x2 mesh, the following will take the device shard living in device 0 and move it to device 1. The device shard living in device 1 will move to device 0. %source_target_pairs: [[0, 1], [1, 0]]

In the case of missing 'dest', the device shard living on that device will contain values of 0. For example, device shard living in device 0 will contain 0 values. %source_target_pairs: [[0, 1]]

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
source_target_pairs::mlir::DenseIntElementsAttr64-bit signless integer elements attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.concat (tt::ttir::ConcatOp)

Tensor concatenation operation.

The concat operation joins multiple tensors along a specified dimension.

This operation concatenates a list of tensors along the dimension specified by dim. All input tensors must have the same shape except for the dimension being concatenated, and the output tensor's shape will match the input tensors except for the concatenated dimension, which will be the sum of the input dimensions.

Example:

// Concatenate along dimension 0
%input1 = ... : tensor<2x3xf32>
%input2 = ... : tensor<3x3xf32>
%output = ttir.empty() : tensor<5x3xf32>
%result = ttir.concat(%input1, %input2, %output) {dim = 0 : i32} :
    tensor<2x3xf32>, tensor<3x3xf32>, tensor<5x3xf32> -> tensor<5x3xf32>
// Input1 shape: [2, 3]
// Input2 shape: [3, 3]
// Output shape: [5, 3]

// Concatenate along dimension 1
%input1 = ... : tensor<2x3xf32>
%input2 = ... : tensor<2x2xf32>
%output = ttir.empty() : tensor<2x5xf32>
%result = ttir.concat(%input1, %input2, %output) {dim = 1 : i32} :
    tensor<2x3xf32>, tensor<2x2xf32>, tensor<2x5xf32> -> tensor<2x5xf32>
// Input1 shape: [2, 3]
// Input2 shape: [2, 2]
// Output shape: [2, 5]

Inputs:

  • inputs (Variadic Tensor): A list of input tensors to concatenate.

Attributes:

  • dim (Integer): The dimension along which to concatenate the tensors.

Outputs:

  • result (Tensor): The concatenated tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.constant (tt::ttir::ConstantOp)

Tensor constant creation operation.

The constant operation creates a tensor with values specified by a constant attribute.

This operation is used to create tensors with predefined values that remain constant throughout program execution. It's commonly used for initializing model weights, biases, and other fixed parameters in neural networks.

Example:

// Create a 2D tensor of zeros
%result = ttir.constant() {
    value = dense<0> : tensor<2x3xi32>
} : () -> tensor<2x3xi32>
// Result: [[0, 0, 0], [0, 0, 0]]

// Create a 1D tensor with specific floating-point values
%result = ttir.constant() {
    value = dense<[0.2, 1.3]> : tensor<2xf32>
} : () -> tensor<2xf32>
// Result: [0.2, 1.3]

// Create a scalar constant
%result = ttir.constant() {
    value = dense<5.0> : tensor<f32>
} : () -> tensor<f32>
// Result: 5.0

// Create a 2D tensor with different values
%result = ttir.constant() {
    value = dense<[[1, 2, 3], [4, 5, 6]]> : tensor<2x3xi32>
} : () -> tensor<2x3xi32>
// Result: [[1, 2, 3], [4, 5, 6]]

Attributes:

  • value (DenseElementsAttr): The constant value of the tensor.

Outputs:

  • result (Tensor): The tensor with the specified constant values.

Note: The shape and element type of the result tensor are determined by the value attribute. The constant operation is typically folded during compilation, allowing for optimizations such as constant propagation.

Traits: AlwaysSpeculatableImplTrait, ConstantLike, TT_CreationOpTrait

Interfaces: BufferizableOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
value::mlir::ElementsAttrconstant vector/tensor attribute

Results:

ResultDescription
resultranked tensor of any type values

ttir.conv2d (tt::ttir::Conv2dOp)

Conv2d operation.

Applies a 2D convolution over an input image composed of several input planes.

This operation performs a 2D convolution on the input tensor using the provided weight tensor and optional bias. It supports configurable stride, padding, dilation, and grouping parameters to control the convolution behavior.

Example:

// Basic 2D convolution
%input = ... : tensor<1x28x28x3xf32>    // Batch size 1, 28x28 image, 3 channels
%weight = ... : tensor<16x3x3x3xf32>    // 16 output channels, 3 input channels, 3x3 kernel
%bias = ... : tensor<1x1x1x16xf32>      // Bias for 16 output channels
%output = ttir.empty() : tensor<1x26x26x16xf32>  // Output shape with no padding
%result = ttir.conv2d(%input, %weight, %bias, %output) {
    stride = [1, 1],
    padding = [0, 0, 0, 0],
    dilation = [1, 1],
    groups = 1
} : tensor<1x28x28x3xf32>, tensor<16x3x3x3xf32>, tensor<1x1x1x16xf32>, tensor<1x26x26x16xf32> -> tensor<1x26x26x16xf32>

// Convolution with stride 2 and padding
%input = ... : tensor<1x28x28x3xf32>    // Batch size 1, 28x28 image, 3 channels
%weight = ... : tensor<16x3x3x3xf32>    // 16 output channels, 3 input channels, 3x3 kernel
%bias = ... : tensor<1x1x1x16xf32>      // Bias for 16 output channels
%output = ttir.empty() : tensor<1x14x14x16xf32>  // Output shape with stride 2
%result = ttir.conv2d(%input, %weight, %bias, %output) {
    stride = [2, 2],
    padding = [1, 1, 1, 1],
    dilation = [1, 1],
    groups = 1
} : tensor<1x28x28x3xf32>, tensor<16x3x3x3xf32>, tensor<1x1x1x16xf32>, tensor<1x14x14x16xf32> -> tensor<1x14x14x16xf32>

Inputs:

  • input (AnyRankedTensor): expected in the following format (N, H_in, W_in, C) where:
    • N is the batch size
    • H_in is the height of the input planes
    • W_in is the width of the input planes
    • C is the number of channels
  • weight (AnyRankedTensor): expected in the following format (O, C/G, K_H, K_W) where:
    • C is the number of input channels
    • O is the number of output channels
    • G is the number of groups
    • K_H is the height of the kernel
    • K_W is the width of the kernel
  • bias Optional: expected in the following format (1, 1, 1, O).

Attributes:

  • stride (i32 | array<2xi32>):
    • i32: Same stride for height and width dimensions (sH = sW = value).
    • array<2xi32>: [sH, sW] where sH is stride for height and sW is stride for width.
  • padding (i32 | array<2xi32> | array<4xi32>):
    • i32: Same padding for all sides (pT = pL = pB = pR = value).
    • array<2xi32>: [pH, pW] where pH is padding for height (top/bottom) and pW is padding for width (left/right).
    • array<4xi32>: [pT, pL, pB, pR] for top, left, bottom, and right padding respectively.
  • dilation (i32 | array<2xi32>): Spacing between kernel elements.
    • i32: Same dilation for height and width dimensions (dH = dW = value).
    • array<2xi32>: [dH, dW] where dH is dilation for height and dW is dilation for width.
  • groups (i32): Number of blocked connections from input channels to output channels. Input and output channels must both be divisible by groups.

Outputs:

  • result AnyRankedTensor: expected in the following format (N, H_out, W_out, O) where:
    • H_out = (H_in + pT + pB - dH * (K_H - 1) - 1) / sH + 1
    • W_out = (W_in + pL + pR - dW * (K_W - 1) - 1) / sW + 1

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
stride::mlir::Attribute32-bit signless integer attribute or i32 dense array attribute
padding::mlir::Attribute32-bit signless integer attribute or i32 dense array attribute
dilation::mlir::Attribute32-bit signless integer attribute or i32 dense array attribute
groups::mlir::IntegerAttr32-bit signless integer attribute
flattened_compat_info::mlir::tt::ttir::FlattenedCompatInfoAttr
Information for sliding window operations with tensors flattened to (1, 1, N*H*W, C){{% markdown %}} This attribute marks operations that are compatible with flattened tensors. It is used as a marker and doesn't carry any additional data. {{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
biasranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.conv_transpose2d (tt::ttir::ConvTranspose2dOp)

ConvTranspose2d operation.

Applies a 2D transposed convolution operator over an input image composed of several input planes.

This operation performs the gradient of a 2D convolution with respect to the input, which is useful for tasks like upsampling feature maps in neural networks. It supports configurable stride, padding, dilation, output padding, and grouping parameters.

Example:

// Basic 2D transposed convolution
%input = ... : tensor<1x14x14x16xf32>   // Batch size 1, 14x14 feature map, 16 channels
%weight = ... : tensor<16x8x3x3xf32>    // 16 input channels, 8 output channels, 3x3 kernel
%bias = ... : tensor<1x1x1x8xf32>       // Bias for 8 output channels
%output = ttir.empty() : tensor<1x28x28x8xf32>  // Output shape with stride 2
%result = ttir.conv_transpose2d(%input, %weight, %bias, %output) {
    stride = [2, 2],
    padding = [0, 0, 0, 0],
    dilation = [1, 1],
    output_padding = [0, 0],
    groups = 1
} : tensor<1x14x14x16xf32>, tensor<16x8x3x3xf32>, tensor<1x1x1x8xf32>, tensor<1x28x28x8xf32> -> tensor<1x28x28x8xf32>

// Transposed convolution with padding and output padding
%input = ... : tensor<1x14x14x16xf32>   // Batch size 1, 14x14 feature map, 16 channels
%weight = ... : tensor<16x8x4x4xf32>    // 16 input channels, 8 output channels, 4x4 kernel
%bias = ... : tensor<1x1x1x8xf32>       // Bias for 8 output channels
%output = ttir.empty() : tensor<1x29x29x8xf32>  // Output shape with output padding
%result = ttir.conv_transpose2d(%input, %weight, %bias, %output) {
    stride = [2, 2],
    padding = [1, 1, 1, 1],
    dilation = [1, 1],
    output_padding = [1, 1],
    groups = 1
} : tensor<1x14x14x16xf32>, tensor<16x8x4x4xf32>, tensor<1x1x1x8xf32>, tensor<1x29x29x8xf32> -> tensor<1x29x29x8xf32>

Inputs:

  • input AnyRankedTensor: expected in the following format (N, H_in, W_in, C) where:
    • N is the batch size
    • H_in is the height of the input planes
    • W_in is the width of the input planes
    • C is the number of channels
  • weight (AnyRankedTensor): expected in the following format (C, O/G, K_H, K_W) where:
    • C is the number of input channels
    • O is the number of output channels
    • G is the number of groups
    • K_H is the height of the kernel
    • K_W is the width of the kernel
  • bias Optional: expected in the following format (1, 1, 1, O).

Attributes:

  • stride (i32 | array<2xi32>): Controls the stride for the cross-correlation.
  • padding (i32 | array<2xi32> | array<4xi32>): Controls the amount of implicit zero padding on both sides for dilation * (kernel_size - 1) - padding number of points.
  • output_padding (i32 | array<2xi32>): Controls the additional size added to one side of the output shape.
  • dilation (i32 | array<2xi32>): Controls the spacing between the kernel points
  • groups i32: Controls the connections between inputs and outputs. Must be divisible by input and output channels.

Outputs:

  • result AnyRankedTensor: expected in the following format (N, H_out, W_out, O) where:
    • H_out = (H_in - 1) * stride[0] - (padding_top + padding_bottom) + dilation[0] * (K_H - 1) + output_padding[0] + 1
    • W_out = (W_in - 1) * stride[1] - (padding_left + padding_right) + dilation[1] * (K_W - 1) + output_padding[1] + 1

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
stride::mlir::Attribute32-bit signless integer attribute or i32 dense array attribute
padding::mlir::Attribute32-bit signless integer attribute or i32 dense array attribute
output_padding::mlir::Attribute32-bit signless integer attribute or i32 dense array attribute
dilation::mlir::Attribute32-bit signless integer attribute or i32 dense array attribute
groups::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
biasranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.convolution (tt::ttir::ConvolutionOp)

Generalized convolution operation.

This operation is a more flexible form of convolution that can handle arbitrary dimensionality and supports various configuration options. It's designed to be a generalization of specific convolution operations like conv2d and conv_transpose2d.

Example:

// 2D convolution using the generalized convolution operation
%lhs = ... : tensor<1x32x32x3xf32>     // Input tensor: batch size 1, 32x32 image, 3 channels
%rhs = ... : tensor<5x5x3x16xf32>      // Filter tensor: 5x5 kernel, 3 input channels, 16 output channels
%output = ttir.empty() : tensor<1x28x28x16xf32>  // Output tensor
%result = ttir.convolution(%lhs, %rhs, %output) {
    window_strides = [1, 1],
    padding = [[0, 0], [0, 0]],
    lhs_dilation = [1, 1],
    rhs_dilation = [1, 1],
    window_reversal = [false, false],
    dimension_numbers = {
        input_batch_dimension = 0,
        input_feature_dimension = 3,
        input_spatial_dimensions = [1, 2],
        kernel_input_feature_dimension = 2,
        kernel_output_feature_dimension = 3,
        kernel_spatial_dimensions = [0, 1],
        output_batch_dimension = 0,
        output_feature_dimension = 3,
        output_spatial_dimensions = [1, 2]
    },
    feature_group_count = 1,
    batch_group_count = 1
} : tensor<1x32x32x3xf32>, tensor<5x5x3x16xf32>, tensor<1x28x28x16xf32> -> tensor<1x28x28x16xf32>

Inputs:

  • input - The input tensor.
  • weight - The filter/kernel tensor.
  • bias - The bias tensor.

Attributes:

  • window_strides (Array): Stride of the sliding window for each spatial dimension.
  • padding (Array): Padding applied to the input in each spatial dimension.
  • input_dilation (Array): Dilation factor for the input in each spatial dimension.
  • weight_dilation (Array): Dilation factor for the filter in each spatial dimension.
  • window_reversal (Array): Whether to reverse the window in each spatial dimension.
  • convolution_layout (Struct): Specifies the dimension numbering in the inputs and outputs.
  • feature_group_count (Integer): Number of feature groups for grouped convolution.
  • batch_group_count (Integer): Number of batch groups for grouped convolution.

Outputs:

  • result (Tensor): Output tensor containing the result of the convolution.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
window_strides::mlir::DenseI64ArrayAttri64 dense array attribute
padding::mlir::DenseI64ArrayAttri64 dense array attribute
input_dilation::mlir::DenseI64ArrayAttri64 dense array attribute
weight_dilation::mlir::DenseI64ArrayAttri64 dense array attribute
window_reversal::mlir::DenseBoolArrayAttri1 dense array attribute
convolution_layout::mlir::tt::ttir::ConvolutionLayoutAttr
Structure of dimension information for convolution op{{% markdown %}} Holds the layout information for the input activation, weights, and output. {{% /markdown %}}
feature_group_count::mlir::IntegerAttr64-bit signless integer attribute whose value is positive
batch_group_count::mlir::IntegerAttr64-bit signless integer attribute whose value is positive

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
biasranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
«unnamed»ranked tensor of any type values

ttir.cos (tt::ttir::CosOp)

Elementwise cosine operation.

The cos operation computes the cosine of each element in the input tensor.

For each element, it returns the cosine of the angle in radians.

Example:

// Compute cosine of all elements in %input
%result = ttir.cos(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.9601, 0.5403, -0.9553, -0.1365], ... ]

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.cumsum (tt::ttir::CumSumOp)

Cumulative sum operation.

The cumsum operation computes the cumulative sum of elements along a specified dimension of the input tensor.

For each position in the output tensor, this operation computes the sum of all elements in the input tensor along the specified dimension up to and including that position. The shape of the output tensor matches the shape of the input tensor.

Example:

// Cumulative sum along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2x3xf32>
%result = ttir.cumsum(%input, %output) {dim = 0 : i64} : tensor<2x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// Input tensor:
// [[1, 2, 3],
//  [4, 5, 6]]
// Output tensor:
// [[1, 2, 3],   // first row remains the same
//  [5, 7, 9]]   // each element is the sum of the corresponding column up to this point

// Cumulative sum along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2x3xf32>
%result = ttir.cumsum(%input, %output) {dim = 1 : i64} : tensor<2x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// Input tensor:
// [[1, 2, 3],
//  [4, 5, 6]]
// Output tensor:
// [[1, 3, 6],   // each element is the sum of the corresponding row up to this point
//  [4, 9, 15]]

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • dim (Integer): The dimension along which to compute the cumulative sum.

Outputs:

  • result (Tensor): The tensor containing the cumulative sums.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr64-bit signless integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.dealloc (tt::ttir::DeallocOp)

Dealloc op.

Tensor Dealloc operation

Operands:

OperandDescription
resultranked tensor of any type values

ttir.dequantize (tt::ttir::DequantizeOp)

Dequantize operation.

The Dequantize operation converts a quantized tensor back into a floating-point tensor using the quant.uniform type from the MLIR Quant dialect. The input tensor is expected to be of type quant.uniform. The output tensor will be a floating-point tensor, where each element is computed as:

output[i] = (input[i] - zero_point) * scale

Example:

%input = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%output = ttir.empty() : () -> tensor<64x128xf32>
%dequantized = "ttir.dequantize"(%input, %output) : (tensor<64x128x!quant.uniform<i32:f32, 0.1>, tensor<64x128xf32>) -> tensor<64x128xf32>

// In this example:
// - The input is a 64x128 tensor of 32-bit quantized values
// - The output is a 64x128 tensor of 32-bit floating-point values
// - The scale is 0.1 (each step represents 0.1 in the original scale)
// - The zero point is 128 (the value 128 in the quantized space represents 0.0 in the original space)

Inputs:

  • input (Quantized Tensor): The quantized tensor to be dequantized.

Results:

  • result (Tensor): The floating-point tensor after dequantization.

Note: The quantization parameters (scale and zero point) are specified in the input tensor type. Dequantization is the reverse process of quantization, converting quantized values back to floating-point values.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.dequantize_unrolled (tt::ttir::DequantizeUnrolledOp)

Dequantize operation unrolled (scale and zero point as input operands).

The DequantizeUnrolledOp dequantizes a tensor using the scale and zero point provided as input operands.

Inputs:

  • input AnyRankedTensor: The input tensor to be dequantized. Must have quantized element type.
  • scale AnyRankedTensor: The scale factor (or factors for per-axis quantization).
  • zero_point AnyRankedTensor: The zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
  • axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
axis::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
scaleranked tensor of any type values
zero_pointranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.div (tt::ttir::DivOp)

Elementwise division operation.

The div operation performs an elementwise division between two tensors.

For each pair of corresponding elements, it divides the element in the first tensor (dividend) by the element in the second tensor (divisor) and places the result in the output tensor.

Example:

// Division operation
%result = ttir.div(%lhs, %rhs, %output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensors:
// %lhs: [10, 20, 20]
// %rhs: [1, 2, 3]
// Output tensor:
// [10, 10, 6]

// Example with floating point values
%result = ttir.div(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [3.5, 0.0, -1.2]
// %float_rhs: [1.5, 2.0, -3.2]
// Output tensor:
// [2.333333333, 0.0, -0.375]

Note: Division by zero typically results in undefined behavior or NaN for floating-point types.

Mathematical definition: div(x, y) = x / y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.dot_general (tt::ttir::DotGeneralOp)

Dot general operation.

Flexible tensor operation that generalizes matrix multiplication by allowing user to specify which dimensions of two tensors to contract. Matrix multiplication is a special case of this operation, where the contraction happens along the last axis of the first tensor and the second-to-last axis of the second tensor. From StableHLO DotGeneral Op https://openxla.org/stablehlo/spec#dot_general

Attributes:

AttributeMLIR TypeDescription
batch_dims_lhs::mlir::DenseI64ArrayAttri64 dense array attribute
contract_dims_lhs::mlir::DenseI64ArrayAttri64 dense array attribute
batch_dims_rhs::mlir::DenseI64ArrayAttri64 dense array attribute
contract_dims_rhs::mlir::DenseI64ArrayAttri64 dense array attribute

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.embedding_backward (tt::ttir::EmbeddingBackwardOp)

Embedding backward operation.

The embedding_backward operation computes the gradient of the embedding operation with respect to the weight tensor.

This operation takes an input tensor of indices, the original weight tensor, and the gradient tensor from the forward pass. It computes how the embedding weights should be updated during backpropagation by accumulating gradients at the appropriate indices in the weight tensor.

Example:

// Embedding backward
%input = ... : tensor<2x3xi32>  // Original indices used in the forward pass
%weight = ... : tensor<10x4xf32>  // Original embedding table
%in_gradient = ... : tensor<2x3x4xf32>  // Gradient from the forward pass
%output = ttir.empty() : tensor<10x4xf32>  // Gradient for the embedding table
%result = ttir.embedding_backward(%input, %weight, %in_gradient, %output) :
    tensor<2x3xi32>, tensor<10x4xf32>, tensor<2x3x4xf32>, tensor<10x4xf32> -> tensor<10x4xf32>

// Input tensor (indices):
// [[0, 2, 5],
//  [7, 1, 9]]

// Input gradient tensor (from forward pass):
// [[[0.1, 0.2, 0.3, 0.4],  // gradient for embedding of index 0
//   [0.5, 0.6, 0.7, 0.8],  // gradient for embedding of index 2
//   [...]],                 // gradient for embedding of index 5
//  [[...],                  // gradient for embedding of index 7
//   [0.9, 1.0, 1.1, 1.2],  // gradient for embedding of index 1
//   [...]]]                 // gradient for embedding of index 9

// Output tensor (gradient for the embedding table):
// The gradients are accumulated at the corresponding indices in the weight tensor.
// For example, at index 0, the gradient is [0.1, 0.2, 0.3, 0.4]

Note: If the same index appears multiple times in the input tensor, the gradients are accumulated (added) at that index in the output tensor.

Inputs:

  • input (Tensor): The original input tensor containing indices used in the forward pass.
  • weight (Tensor): The original embedding table tensor.
  • in_gradient (Tensor): The gradient tensor from the forward pass.

Outputs:

  • result (Tensor): The gradient tensor for the embedding table.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
in_gradientranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.embedding (tt::ttir::EmbeddingOp)

Embedding lookup operation.

The embedding operation performs a lookup in an embedding table (weight matrix) using integer indices.

This operation takes an input tensor of indices and a weight tensor representing the embedding table. For each index in the input tensor, it retrieves the corresponding row from the weight tensor. The result is a tensor where each input index is replaced by its corresponding embedding vector.

Example:

// Embedding lookup
%input = ... : tensor<2x3xi32>  // Batch of indices
%weight = ... : tensor<10x4xf32>  // Embedding table with 10 entries of dimension 4
%output = ttir.empty() : tensor<2x3x4xf32>
%result = ttir.embedding(%input, %weight, %output) : tensor<2x3xi32>, tensor<10x4xf32>, tensor<2x3x4xf32> -> tensor<2x3x4xf32>

// Input tensor (indices):
// [[0, 2, 5],
//  [7, 1, 9]]

// Weight tensor (embedding table):
// [[0.1, 0.2, 0.3, 0.4],  // embedding vector for index 0
//  [0.5, 0.6, 0.7, 0.8],  // embedding vector for index 1
//  [0.9, 1.0, 1.1, 1.2],  // embedding vector for index 2
//  ...
//  [1.7, 1.8, 1.9, 2.0]]  // embedding vector for index 9

// Output tensor:
// [[[0.1, 0.2, 0.3, 0.4],  // embedding for index 0
//   [0.9, 1.0, 1.1, 1.2],  // embedding for index 2
//   [...]],                 // embedding for index 5
//  [[...],                  // embedding for index 7
//   [0.5, 0.6, 0.7, 0.8],  // embedding for index 1
//   [...]]]                 // embedding for index 9

Note: The indices in the input tensor must be valid indices into the first dimension of the weight tensor.

Inputs:

  • input (Tensor): The input tensor containing indices.
  • weight (Tensor): The embedding table tensor.

Outputs:

  • result (Tensor): The resulting tensor containing the embeddings.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.empty (tt::ttir::EmptyOp)

Empty tensor allocation operation.

Syntax:

operation ::= `ttir.empty` `(` `)` attr-dict `:` type($result)

The empty operation creates an uninitialized tensor with the specified shape and element type.

This operation allocates memory for a tensor but does not initialize its values. It's commonly used as a first step before filling the tensor with computed values. The shape and element type of the tensor are determined by the return type.

Example:

// Create an uninitialized 2D tensor with shape [3, 4]
%result = ttir.empty() : tensor<3x4xf32>

// Create an uninitialized 3D tensor with shape [2, 3, 4]
%result = ttir.empty() : tensor<2x3x4xi32>

// Use empty to create a tensor for storing computation results
%input = ... : tensor<10x20xf32>
%output = ttir.empty() : tensor<10x20xf32>
%result = ttir.some_computation(%input, %output) : tensor<10x20xf32>, tensor<10x20xf32> -> tensor<10x20xf32>

Outputs:

  • result (Tensor): The uninitialized tensor.

Note: Since the tensor is uninitialized, reading from it before writing may yield undefined values. This operation is typically used in conjunction with other operations that will fill the tensor with meaningful values. The empty operation is more efficient than zeros or ones when the tensor will be completely overwritten, as it avoids the initialization step.

Traits: AlwaysSpeculatableImplTrait, TT_CreationOpTrait

Interfaces: BufferizableOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results:

ResultDescription
resultranked tensor of any type values

ttir.eq (tt::ttir::EqualOp)

Elementwise equality comparison operation.

The eq operation performs an elementwise equality comparison between two tensors.

For each pair of corresponding elements, it returns:

  • 1 (true) if the elements are equal
  • 0 (false) if the elements are not equal

Note that special handling may be required for floating-point NaN values, as NaN is not equal to any value, including itself.

Example:

// Compare elements for equality
%result = ttir.eq(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[1, 1, 0, 0], ... ]  // 1 where equal, 0 where not equal

// Example with integer tensors
%result = ttir.eq(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [1, 0, 0]  // Only the first elements are equal

Mathematical definition: equal(x, y) = x == y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.erf (tt::ttir::ErfOp)

Element-wise error function operation.

Element-wise error function (erf) operation. Calculates erf(x) for each element of the input tensor.

Example:

// Compute error function for all elements in %input
%result = ttir.erf(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor with values [0.0, 1.0, -1.0, 2.0]
// Output tensor with values [0.0, 0.8427, -0.8427, 0.9953]

Mathematical definition: erf(x) = (2/√π) ∫₀ˣ e^(-t²) dt

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.erfc (tt::ttir::ErfcOp)

Element-wise complementary error function operation.

Element-wise complementary error function (erfc) operation. Calculates erfc(x) = 1 - erf(x) for each element of the input tensor.

Example:

// Compute complementary error function for all elements in %input
%result = ttir.erfc(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor with values [0.0, 1.0, -1.0, 2.0]
// Output tensor with values [1.0, 0.1573, 1.8427, 0.0047]

Mathematical definition: erfc(x) = 1 - erf(x) = 1 - (2/√π) ∫ₓ^∞ e^(-t²) dt

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.exp (tt::ttir::ExpOp)

Elementwise exponential op.

The exp operation computes the exponential of each element in the input tensor.

For each element, it returns e^x, where e is the base of natural logarithms (approximately 2.71828).

Example:

// Compute exponential of all elements in %input
%result = ttir.exp(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.0, 2.0, -3.0, 4.0], ... ]
// Output tensor:
// [[2.71828, 7.389056, 0.090031, 54.59815], ... ]

Mathematical definition: exp(x) = e^x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.expm1 (tt::ttir::Expm1Op)

Elementwise exponential minus one operation.

The expm1 operation computes the exponential of each element in the input tensor and subtracts one.

For each element x, it returns e^x - 1. This operation is more accurate than computing exp(x) - 1 directly for x values close to zero, where catastrophic cancellation can occur in the subtraction.

Example:

// Compute expm1 of all elements in %input
%result = ttir.expm1(%input, %output) : tensor<2x2xf32>, tensor<2x2xf32> -> tensor<2x2xf32>
// Input tensor:
// [[0.0, 1.0],
//  [0.0, 0.0]]
// Output tensor:
// [[0.0, 1.71828],
//  [0.0, 0.0]]

// Example with small values where expm1 is more accurate than exp(x)-1
%result = ttir.expm1(%small_input, %small_output) : tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [1e-10, 1e-7, 1e-5]
// Output tensor:
// [1e-10, 1e-7, 1e-5]  // Approximately equal to the input for very small values

Mathematical definition: expm1(x) = e^x - 1

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.fill_cache (tt::ttir::FillCacheOp)

Cache filling operation.

The fill_cache operation fills a cache tensor with values from an input tensor.

Unlike update_cache which updates specific positions, this operation fills the entire cache or a contiguous section of it with values from the input tensor. This is commonly used to initialize a cache in sequence models.

Example:

// Fill cache with input values
%cache = ... : tensor<2x16x64xf32>  // Batch size 2, sequence length 16, hidden dim 64
%input = ... : tensor<2x16x64xf32>  // Initial values for the entire cache
%result = ttir.fill_cache(%cache, %input) {batch_offset = 0 : i32} :
    tensor<2x16x64xf32>, tensor<2x16x64xf32> -> tensor<2x16x64xf32>
// The entire cache tensor is filled with values from input

// Fill a portion of the cache
%cache = ... : tensor<2x16x64xf32>  // Batch size 2, sequence length 16, hidden dim 64
%input = ... : tensor<2x8x64xf32>   // Values for half of the cache
%result = ttir.fill_cache(%cache, %input) {batch_offset = 0 : i32} :
    tensor<2x16x64xf32>, tensor<2x8x64xf32> -> tensor<2x16x64xf32>
// The first 8 positions of the cache are filled with values from input

Inputs:

  • cache (Tensor): The cache tensor to be filled.
  • input (Tensor): The input tensor containing the values to fill the cache with.

Attributes:

  • batch_offset (Integer): Offset in the batch dimension.

Outputs:

  • result (Tensor): The filled cache tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
batch_offset::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
cacheranked tensor of any type values
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.floor (tt::ttir::FloorOp)

Elementwise floor operation.

The floor operation computes the floor (greatest integer less than or equal to x) of each element in the input tensor.

For each element, it rounds the value down to the nearest integer. The operation preserves the data type of the input.

This operation has the idempotence property, meaning that applying it multiple times produces the same result as applying it once: floor(floor(x)) = floor(x).

Example:

// Compute floor of all elements in %input
%result = ttir.floor(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[1.0, 2.0, -1.0, 4.0], ... ]

Mathematical definition: floor(x) = ⌊x⌋ = max{n ∈ ℤ | n ≤ x}

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Idempotence, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.full (tt::ttir::FullOp)

Creates a tensor filled with the specified value

Tensor operation to create a tensor filled with a specified value.

Given a shape and a fill_value, produces a tensor with the shape, filled with the specified value.

Example: %0 = "ttir.full"() <{shape = array<i32: 64, 32, 32>, fill_value = 7 : i32}> : () -> tensor<64x32x32xi32> // %0: [[[7, 7, 7, ..., 7], [7, 7, 7, ..., 7], ..., [7, 7, 7, ..., 7]]]

Traits: AlwaysSpeculatableImplTrait, TT_CreationOpTrait

Interfaces: BufferizableOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shape::mlir::DenseI32ArrayAttri32 dense array attribute
fill_value::mlir::Attribute32-bit float attribute or 32-bit signless integer attribute

Results:

ResultDescription
resultranked tensor of any type values

ttir.gather (tt::ttir::GatherOp)

Gather operation.

The gather operation collects slices from an input tensor at positions specified by start indices.

This operation is based on the StableHLO Gather operation (https://openxla.org/stablehlo/spec#gather) and allows for flexible slicing and indexing of tensors. It can be used to implement operations like array indexing, slicing, dynamic indexing, and more complex gathering patterns.

Example:

// Basic gather example: gather elements from a 2D tensor using indices
%input = ... : tensor<5x3xf32>         // Input tensor with shape [5,3]
%indices = ... : tensor<2xi64>         // Indices tensor with values [2, 1]
%output = ttir.empty() : tensor<3xf32> // Output tensor
%result = ttir.gather(%input, %indices, %output) {
    offset_dims = [0],                 // Output dimensions that are gathered from input
    collapsed_slice_dims = [0],        // Input dimensions that are collapsed
    operand_batching_dims = [],        // Batch dimensions of the input
    start_indices_batching_dims = [],  // Batch dimensions of the indices
    start_index_map = [0],             // Maps indices to input dimensions
    index_vector_dim = 0,              // Which dimension of indices contains the index vector
    slice_sizes = [1, 3],              // Size of the slice to extract from each position
    indices_are_sorted = false         // Whether indices are sorted
} : tensor<5x3xf32>, tensor<2xi64>, tensor<3xf32> -> tensor<3xf32>

// This gathers a slice of size [1,3] starting at position [2,0] from the input tensor,
// which results in the values from the third row of the input tensor.

Inputs:

  • input (Tensor): The tensor from which to gather values.
  • start_indices (Tensor): Tensor containing the starting indices for slices.

Attributes:

  • offset_dims (Array of Integer): Output dimensions that correspond to dimensions of the gathered slice.
  • collapsed_slice_dims (Array of Integer): Input dimensions that are collapsed when gathering.
  • operand_batching_dims (Array of Integer): Batch dimensions of the input tensor.
  • start_indices_batching_dims (Array of Integer): Batch dimensions of the indices tensor.
  • start_index_map (Array of Integer): Maps index values to input dimensions.
  • index_vector_dim (Integer): Which dimension of indices contains the index vector.
  • slice_sizes (Array of Integer): Size of the slice to extract from each position.
  • indices_are_sorted (Boolean): Whether indices are sorted (for optimization).

Outputs:

  • result (Tensor): The gathered tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
offset_dims::mlir::DenseI64ArrayAttri64 dense array attribute
collapsed_slice_dims::mlir::DenseI64ArrayAttri64 dense array attribute
operand_batching_dims::mlir::DenseI64ArrayAttri64 dense array attribute
start_indices_batching_dims::mlir::DenseI64ArrayAttri64 dense array attribute
start_index_map::mlir::DenseI64ArrayAttri64 dense array attribute
index_vector_dim::mlir::IntegerAttr64-bit signed integer attribute
slice_sizes::mlir::DenseI64ArrayAttri64 dense array attribute
indices_are_sorted::mlir::BoolAttrbool attribute

Operands:

OperandDescription
inputranked tensor of any type values
start_indicesranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.gelu (tt::ttir::GeluOp)

Elementwise GELU operation.

The gelu operation computes the GELU (Gaussian Error Linear Unit) of each element in the input tensor.

For each element, it returns the GELU value, which is a smooth, non-monotonic function that approximates the cumulative distribution function of a standard normal distribution. The operation preserves the data type of the input.

Example:

// Compute GELU of all elements in %input
%result = ttir.gelu(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.9601, 0.5403, -0.3, 4.5], ... ]

Mathematical definition: gelu(x) = 0.5 * x * (1 + erf(x / sqrt(2)))

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.generic (tt::ttir::GenericOp)

Generically dispatch work to a grid of cores.

Syntax:

operation ::= `ttir.generic` attr-dict `\n`
              ` ` ` ` ` ` ` ` `ins` `(` $inputs `:` type($inputs) `)` `\n`
              ` ` ` ` ` ` ` ` `outs` `(` $outputs  `:` type($outputs) `)` ` `  $regions (`:`  type($results)^ )?

This generic op carries a region that represents the work each core does. The region is expected to have the same signature as the op itself with respect to input and output operands. The op is expected to be lowered to a backend specific form by a consuming backend. This op is heavily inspired by the linalg.generic op so it can be useful to refer to linalg.generic documentation for more details.

%5 = "ttir.generic"(%1, %2, %3, %4) <{
  grid = #tt.grid<1x1>,                        // The grid range of cores to dispatch work to.
  indexing_maps = [#map, #map, #map],          // Affine maps for indexing into the input/output tensors. See linalg.generic
  iterator_types = [#parallel, #parallel],     // Iterator types for the input/output tensors. See linalg.generic
  threads = [#ttir.thread<compute>],           // Thread types for the regions.
  operandSegmentSizes = array<i32: 2, 1>       // Sizes of the operand segments, i.e. 2 inputs and 1 output.
}> ({
^bb0(%arg2: memref<64x128xf32, #l1_>,
     %arg3: memref<64x128xf32, #l1_>,
     %arg4: memref<64x128xf32, #l1_>):
    // Region body, would contain some computation that represents the work each core does.
}) : (tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>

Traits: AttrSizedOperandSegments, NoTerminator

Interfaces: BufferizableOpInterface, DestinationStyleOpInterface, MemoryEffectOpInterface, OpAsmOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
grid::mlir::tt::GridAttr
TT grid attribute{{% markdown %}} TT grid attribute {{% /markdown %}}
indexing_maps::mlir::ArrayAttrAffineMap array attribute
iterator_types::mlir::ArrayAttr
threads::mlir::ArrayAttr

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values
outputsvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.get_dimension_size (tt::ttir::GetDimensionSizeOp)

GetDimensionSize op.

Produces the size of the given dimension of the operand.

Example: %operand: [[3, 2, 7], [1, 4, 4]] "ttir.get_dimension_size"(%operand, value = dense<0>, %out) -> %out: [[3]]

Attributes:

AttributeMLIR TypeDescription
dimension::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
operandranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.ge (tt::ttir::GreaterEqualOp)

Elementwise greater than or equal to.

The ge operation performs an elementwise greater than or equal to comparison between two tensors.

For each pair of corresponding elements, it returns:

  • 1 (true) if the left element is greater than or equal to the right element
  • 0 (false) if the left element is less than the right element

Example:

// Compare elements for greater than or equal to
%result = ttir.ge(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[1, 1, 0, 0], ... ]  // 1 where greater or equal, 0 where less

// Example with integer tensors
%result = ttir.ge(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [1, 0, 0]  // Only the first elements are greater or equal

Mathematical definition: greater_equal(x, y) = x >= y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.gt (tt::ttir::GreaterThanOp)

Elementwise greater than.

The gt operation performs an elementwise greater than comparison between two tensors.

For each pair of corresponding elements, it returns:

  • 1 (true) if the left element is greater than the right element
  • 0 (false) if the left element is less than or equal to the right element

Example:

// Compare elements for greater than
%result = ttir.gt(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[0, 0, 0, 1], ... ]  // 1 where greater, 0 where less or equal

// Example with integer tensors
%result = ttir.gt(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [0, 0, 0]  // Only the last element is greater

Mathematical definition: greater_than(x, y) = x > y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.index (tt::ttir::IndexOp)

Tensor indexing operation.

The index operation extracts a sub-tensor (slice) from the input tensor along a specified dimension.

This operation selects elements from the input tensor along a single dimension based on the specified begin, end, and step indices. It's similar to Python's slicing notation tensor[:, begin:end:step, :] where the slicing is applied only to the specified dimension.

Example:

// Extract elements with indices 1, 3, 5 from dimension 0 of a 1D tensor
%input = ... : tensor<6xf32>  // Input tensor with values: [1, 2, 3, 4, 5, 6]
%output = ttir.empty() : tensor<3xf32>  // Output tensor shape
%result = ttir.index(%input, %output) {
    dim = 0 : i32,    // Dimension to index
    begin = 1 : i32,  // Start index
    end = 6 : i32,    // End index (exclusive)
    step = 2 : i32    // Step size
} : tensor<6xf32>, tensor<3xf32> -> tensor<3xf32>
// Result: [2, 4, 6]

// Extract columns 0 and 2 from a 2D tensor
%input = ... : tensor<3x4xf32>  // Input tensor with values:
                                // [[1, 2, 3, 4],
                                //  [5, 6, 7, 8],
                                //  [9, 10, 11, 12]]
%output = ttir.empty() : tensor<3x2xf32>  // Output tensor shape
%result = ttir.index(%input, %output) {
    dim = 1 : i32,    // Index along columns (dimension 1)
    begin = 0 : i32,  // Start from first column
    end = 3 : i32,    // End at third column (exclusive)
    step = 2 : i32    // Take every other column
} : tensor<3x4xf32>, tensor<3x2xf32> -> tensor<3x2xf32>
// Result:
// [[1, 3],
//  [5, 7],
//  [9, 11]]

Inputs:

  • input (Tensor): The input tensor to index.

Attributes:

  • dim (Integer): The dimension along which to index.
  • begin (Integer): The starting index.
  • end (Integer): The ending index (exclusive).
  • step (Integer): The step size between indices.

Outputs:

  • result (Tensor): The indexed tensor.

Note: The shape of the output tensor is the same as the input tensor except for the indexed dimension, which will have size ceil((end - begin) / step). The indices selected will be begin, begin + step, begin + 2*step, etc., up to but not including end.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signless integer attribute
begin::mlir::IntegerAttr32-bit signless integer attribute
end::mlir::IntegerAttr32-bit signless integer attribute
step::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.isfinite (tt::ttir::IsFiniteOp)

Elementwise isfinite operation.

The isfinite operation checks if each element in the input tensor is finite (neither infinite nor NaN).

For each element, it returns a boolean value indicating whether the element is finite.

Example:

// Check if all elements in %input are finite
%result = ttir.isfinite(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, Inf, 4.5], ... ]
// Output tensor:
// [[true, true, false, true], ... ]

Mathematical definition: isfinite(x) = x ∈ ℝ

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.kernel (tt::ttir::KernelOp)

Kernel call.

A generic kernel call operation. This operation is used to pattern match by some consuming backend.

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
op::mlir::FlatSymbolRefAttrflat symbol reference attribute
kind::mlir::FlatSymbolRefAttrflat symbol reference attribute

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values
outputsvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values or non-0-ranked.memref of any type values

ttir.leaky_relu (tt::ttir::LeakyReluOp)

Eltwise leaky relu operation.

The Leaky ReLU (Rectified Linear Unit) operation computes an element-wise activation function over its input tensor. It is defined as:

y = x if x > 0 y = parameter * x if x <= 0

where parameter is a small, user-defined constant that determines the slope for negative inputs.

Inputs:

  • input (Tensor): The input tensor to be activated.

Outputs:

  • output (Tensor): The tensor after applying the Leaky ReLU activation.

Attributes:

  • parameter (float): The slope for negative values.

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
parameter::mlir::FloatAttr32-bit float attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.le (tt::ttir::LessEqualOp)

Elementwise less than or equal to.

The le operation performs an elementwise less than or equal to comparison between two tensors.

For each pair of corresponding elements, it returns:

  • 1 (true) if the left element is less than or equal to the right element
  • 0 (false) if the left element is greater than the right element

Example:

// Compare elements for less than or equal to
%result = ttir.le(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[1, 1, 1, 0], ... ]  // 1 where less or equal, 0 where greater

// Example with integer tensors
%result = ttir.le(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [1, 1, 1]  // All elements are less or equal

Mathematical definition: less_equal(x, y) = x <= y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.lt (tt::ttir::LessThanOp)

Elementwise less than.

The lt operation performs an elementwise less than comparison between two tensors.

For each pair of corresponding elements, it returns:

  • 1 (true) if the left element is less than the right element
  • 0 (false) if the left element is greater than or equal to the right element

Example:

// Compare elements for less than
%result = ttir.lt(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[0, 0, 0, 1], ... ]  // 1 where less, 0 where greater or equal

// Example with integer tensors
%result = ttir.lt(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [0, 0, 0]  // Only the last element is less

Mathematical definition: less_than(x, y) = x < y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.linear (tt::ttir::LinearOp)

Linear transformation operation.

The linear operation performs a linear transformation by computing the matrix multiplication of tensors a and b with an optional addition of a bias tensor.

This operation is commonly used in neural networks to implement fully connected layers. It computes the matrix multiplication of the input tensor with a weight tensor and adds an optional bias.

Example:

// Linear transformation with bias
%a = ... : tensor<10x64x32xbf16>  // Input tensor: batch_size=10, sequence_length=64, input_dim=32
%b = ... : tensor<32x128xbf16>    // Weight tensor: input_dim=32, output_dim=128
%bias = ... : tensor<128xbf16>    // Bias tensor: output_dim=128
%output = ttir.empty() : tensor<10x64x128xbf16>  // Output tensor shape
%result = ttir.linear(%a, %b, %bias, %output) :
    tensor<10x64x32xbf16>, tensor<32x128xbf16>, tensor<128xbf16>, tensor<10x64x128xbf16> -> tensor<10x64x128xbf16>

// Linear transformation without bias
%a = ... : tensor<10x64x32xf32>  // Input tensor
%b = ... : tensor<32x128xf32>    // Weight tensor
%output = ttir.empty() : tensor<10x64x128xf32>  // Output tensor shape
%result = ttir.linear(%a, %b, %output) :
    tensor<10x64x32xf32>, tensor<32x128xf32>, tensor<10x64x128xf32> -> tensor<10x64x128xf32>

Inputs:

  • a (Tensor): The input tensor.
  • b (Tensor): The weight tensor.
  • bias (Optional Tensor): The bias tensor to add to the result of the matrix multiplication.

Attributes:

  • transpose_a (Boolean, default=false): Whether to transpose tensor a before multiplication.
  • transpose_b (Boolean, default=false): Whether to transpose tensor b before multiplication.

Outputs:

  • result (Tensor): The result of the linear transformation.

The operation computes: result = matmul(a, b) + bias

Note: The shapes of the tensors must be compatible for matrix multiplication. For a 3D input tensor with shape [batch_size, sequence_length, input_dim], the weight tensor should have shape [input_dim, output_dim], and the bias tensor should have shape [output_dim]. The resulting tensor will have shape [batch_size, sequence_length, output_dim].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
transpose_a::mlir::BoolAttrbool attribute
transpose_b::mlir::BoolAttrbool attribute

Operands:

OperandDescription
aranked tensor of any type values
branked tensor of any type values
biasranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.log1p (tt::ttir::Log1pOp)

Elementwise natural logarithm of one plus input operation.

The log1p operation computes the natural logarithm of one plus each element in the input tensor.

For each element x, it returns ln(1 + x). This operation is more accurate than computing log(1 + x) directly for x values close to zero, and it is defined for x > -1. For values less than or equal to -1, the behavior depends on the implementation (may return NaN or negative infinity).

Example:

// Compute log1p of all elements in %input
%result = ttir.log1p(%input, %output) : tensor<5xf32>, tensor<5xf32> -> tensor<5xf32>
// Input tensor:
// [0.0, -0.999, 7.0, 6.38905621, 15.0]
// Output tensor:
// [0.0, -6.90776825, 2.07944155, 2.0, 2.77258873]

// Example with small values where log1p is more accurate than log(1+x)
%result = ttir.log1p(%small_input, %small_output) : tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [1e-10, 1e-7, 1e-5]
// Output tensor:
// [1e-10, 1e-7, 1e-5]  // Approximately equal to the input for very small values

Mathematical definition: log1p(x) = ln(1 + x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.log (tt::ttir::LogOp)

Elementwise natural logarithm operation.

The log operation computes the natural logarithm of each element in the input tensor.

For each element, it returns the natural logarithm (base e) of the value. This operation is defined only for positive values; the behavior for zero or negative inputs depends on the implementation (may return NaN, infinity, or other special values).

Example:

// Compute natural logarithm of all elements in %input
%result = ttir.log(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.0, 2.718, 7.389, 20.086], ... ]
// Output tensor:
// [[0.0, 1.0, 2.0, 3.0], ... ]

// Example with different values
%result = ttir.log(%float_input, %float_output) : tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [10.0, 100.0, 1000.0]
// Output tensor:
// [2.303, 4.605, 6.908]  // ln(10), ln(100), ln(1000)

Mathematical definition: log(x) = ln(x), where ln is the natural logarithm

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.logical_and (tt::ttir::LogicalAndOp)

Elementwise logical and.

The logical_and operation performs an elementwise logical AND operation between two tensors.

For each pair of corresponding elements, it returns:

  • 1 (true) if both elements are 1 (true)
  • 0 (false) if at least one element is 0 (false)

Example:

// Logical AND operation
%result = ttir.logical_and(%lhs, %rhs, %output) : tensor<4x4xi1>, tensor<4x4xi1>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1, 0, 1, 0], ... ]
// %rhs: [[1, 1, 0, 1], ... ]
// Output tensor:
// [[1, 0, 0, 0], ... ]  // 1 where both are 1, 0 otherwise

// Example with integer tensors
%result = ttir.logical_and(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, 0, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [1, 0, 0]  // Only the first element is true

Mathematical definition: logical_and(x, y) = x && y

Traits: AlwaysSpeculatableImplTrait, TTIR_BinaryIdempotence, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.logical_not (tt::ttir::LogicalNotOp)

Elementwise logical not operation.

The logical_not operation computes the logical negation of each element in the input tensor.

For each element, it returns a boolean value indicating whether the element is false (zero) or true (non-zero).

Example:

// Compute logical negation of all elements in %input
%result = ttir.logical_not(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.0, 4.5], ... ]
// Output tensor:
// [[false, false, true, false], ... ]

Mathematical definition: logical_not(x) = !x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.logical_or (tt::ttir::LogicalOrOp)

Elementwise logical or.

The logical_or operation performs an elementwise logical OR operation between two tensors.

For each pair of corresponding elements, it returns:

  • 1 (true) if at least one element is 1 (true)
  • 0 (false) if both elements are 0 (false)

Example:

// Logical OR operation
%result = ttir.logical_or(%lhs, %rhs, %output) : tensor<4x4xi1>, tensor<4x4xi1>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1, 0, 1, 0], ... ]
// %rhs: [[1, 1, 0, 1], ... ]
// Output tensor:
// [[1, 1, 1, 1], ... ]  // 1 where at least one is 1, 0 otherwise

// Example with integer tensors
%result = ttir.logical_or(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, 0, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [1, 1, 1]  // All elements are true

Mathematical definition: logical_or(x, y) = x || y

Traits: AlwaysSpeculatableImplTrait, TTIR_BinaryIdempotence, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.logical_xor (tt::ttir::LogicalXorOp)

Elementwise logical xor.

The logical_xor operation performs an elementwise logical XOR operation between two tensors.

For each pair of corresponding elements, it returns:

  • 1 (true) if exactly one element is 1 (true)
  • 0 (false) if both elements are 0 (false) or both are 1 (true)

Example:

// Logical XOR operation
%result = ttir.logical_xor(%lhs, %rhs, %output) : tensor<4x4xi1>, tensor<4x4xi1>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1, 0, 1, 0], ... ]
// %rhs: [[1, 1, 0, 1], ... ]
// Output tensor:
// [[0, 1, 1, 1], ... ]  // 1 where exactly one is 1, 0 otherwise

// Example with integer tensors
%result = ttir.logical_xor(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, 0, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [0, 1, 1]  // Only the last element is true

Mathematical definition: logical_xor(x, y) = x ^^ y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.matmul (tt::ttir::MatmulOp)

Matrix multiplication operation.

The matmul operation computes the matrix multiplication of two tensors.

This operation performs matrix multiplication between tensors a and b. It supports optional transposition of either input tensor before multiplication. For 2D tensors, this computes the standard matrix product. For tensors with more dimensions, it applies batched matrix multiplication.

Example:

// Basic matrix multiplication of 2D tensors
%a = ... : tensor<3x4xf32>  // Matrix A with shape [3,4]
%b = ... : tensor<4x5xf32>  // Matrix B with shape [4,5]
%output = ttir.empty() : tensor<3x5xf32>  // Output matrix shape
%result = ttir.matmul(%a, %b, %output) :
    tensor<3x4xf32>, tensor<4x5xf32>, tensor<3x5xf32> -> tensor<3x5xf32>

// Batched matrix multiplication with transposition
%a = ... : tensor<2x3x4xf32>  // Batch of 2 matrices with shape [3,4]
%b = ... : tensor<2x5x4xf32>  // Batch of 2 matrices with shape [5,4]
%output = ttir.empty() : tensor<2x3x5xf32>  // Output shape
%result = ttir.matmul(%a, %b, %output) {
    transpose_a = false,  // Don't transpose A
    transpose_b = true    // Transpose B before multiplication
} : tensor<2x3x4xf32>, tensor<2x5x4xf32>, tensor<2x3x5xf32> -> tensor<2x3x5xf32>

Inputs:

  • a (Tensor): The first input tensor.
  • b (Tensor): The second input tensor.

Attributes:

  • transpose_a (Boolean, default=false): Whether to transpose tensor a before multiplication.
  • transpose_b (Boolean, default=false): Whether to transpose tensor b before multiplication.

Outputs:

  • result (Tensor): The result of the matrix multiplication.

Note: The inner dimensions of the input tensors must be compatible for matrix multiplication. If a has shape [..., m, k] and b has shape [..., k, n], then the result will have shape [..., m, n]. If transpose_a is true, then a is treated as having shape [..., k, m]. If transpose_b is true, then b is treated as having shape [..., n, k].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
transpose_a::mlir::BoolAttrbool attribute
transpose_b::mlir::BoolAttrbool attribute

Operands:

OperandDescription
aranked tensor of any type values
branked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.max (tt::ttir::MaxOp)

Maximum reduction operation.

The max operation computes the maximum value of elements along specified dimensions of the input tensor.

This operation reduces the input tensor by finding the maximum value of all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the maximum is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example:

// Maximum along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2xf32>
%result = ttir.max(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xf32>, tensor<2xf32> -> tensor<2xf32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// [5.0, 6.0]  // Maximum of each row

// Maximum along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<3xf32>
%result = ttir.max(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// [4.0, 5.0, 6.0]  // Maximum of each column

// Maximum over all dimensions
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<f32>
%result = ttir.max(%input, %output) {keep_dim = false} : tensor<2x3xf32>, tensor<f32> -> tensor<f32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// 6.0  // Maximum of all elements

Note: When comparing with NaN values, NaN is typically not selected as the maximum value.

Mathematical definition: max(x, dim) = max(x[i]) for all i in dimension dim

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • keep_dim (Bool): Whether to keep the reduced dimensions or not.
  • dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

  • output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.max_pool2d (tt::ttir::MaxPool2dOp)

2D maximum pooling operation.

The max_pool2d operation applies a 2D maximum pooling over an input tensor composed of several input planes.

This operation performs downsampling by dividing the input into local regions and computing the maximum value of each region. It reduces the spatial dimensions (height and width) of an input tensor while preserving the batch and channel dimensions. This is commonly used in neural networks to reduce the spatial size of feature maps while retaining the most important features.

Example:

// Basic 2D max pooling with a 2x2 kernel and stride 1
%input = ... : tensor<1x3x3x1xf32>  // 3x3 input tensor with values:
                                    // [[[1, 2, 3],
                                    //   [4, 5, 6],
                                    //   [7, 8, 9]]]]
%output = ttir.empty() : tensor<1x2x2x1xf32>
%result = ttir.max_pool2d(%input, %output) {
    kernel_height = 2 : i32,
    kernel_width = 2 : i32,
    stride_height = 1 : i32,
    stride_width = 1 : i32,
    dilation_height = 1 : i32,
    dilation_width = 1 : i32,
    ceil_mode = false,
    padding_left = 0 : i32,
    padding_right = 0 : i32,
    padding_top = 0 : i32,
    padding_bottom = 0 : i32
} : tensor<1x3x3x1xf32>, tensor<1x2x2x1xf32> -> tensor<1x2x2x1xf32>
// Result: [[[5, 6],
//           [8, 9]]]]
// Where: 5 = max(1,2,4,5), 6 = max(2,3,5,6), 8 = max(4,5,7,8), 9 = max(5,6,8,9)

Inputs:

  • input (Tensor): Input tensor in NHWC format (batch, height, width, channels).

Attributes:

  • kernel_height (Integer): Height of the pooling kernel.
  • kernel_width (Integer): Width of the pooling kernel.
  • stride_height (Integer): Stride along the height dimension.
  • stride_width (Integer): Stride along the width dimension.
  • dilation_height (Integer): Dilation factor for height dimension.
  • dilation_width (Integer): Dilation factor for width dimension.
  • ceil_mode (Boolean): When true, uses ceil instead of floor for output shape calculation.
  • padding_left, padding_right, padding_top, padding_bottom (Integer): Padding on each side.

Outputs:

  • result (Tensor): Output tensor after maximum pooling.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
kernel_height::mlir::IntegerAttr32-bit signed integer attribute
kernel_width::mlir::IntegerAttr32-bit signed integer attribute
stride_height::mlir::IntegerAttr32-bit signed integer attribute
stride_width::mlir::IntegerAttr32-bit signed integer attribute
dilation_height::mlir::IntegerAttr32-bit signed integer attribute
dilation_width::mlir::IntegerAttr32-bit signed integer attribute
ceil_mode::mlir::BoolAttrbool attribute
padding_left::mlir::IntegerAttr32-bit signed integer attribute
padding_right::mlir::IntegerAttr32-bit signed integer attribute
padding_top::mlir::IntegerAttr32-bit signed integer attribute
padding_bottom::mlir::IntegerAttr32-bit signed integer attribute
flattened_compat_info::mlir::tt::ttir::FlattenedCompatInfoAttr
Information for sliding window operations with tensors flattened to (1, 1, N*H*W, C){{% markdown %}} This attribute marks operations that are compatible with flattened tensors. It is used as a marker and doesn't carry any additional data. {{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.maximum (tt::ttir::MaximumOp)

Elementwise maximum operation.

The maximum operation calculates the elementwise maximum between two tensors.

For each pair of corresponding elements, it selects the larger value and places it in the output tensor. This operation has the idempotence property, meaning that applying it twice with the same second operand returns the original result: maximum(maximum(x, y), y) = maximum(x, y).

Example:

// Maximum operation
%result = ttir.maximum(%lhs, %rhs, %output) : tensor<3x3xi32>, tensor<3x3xi32>, tensor<3x3xi32> -> tensor<3x3xi32>
// Input tensors:
// %lhs: [[3, 2, 7], [1, 4, 4]]
// %rhs: [[1, 4, 2], [1, 2, 3]]
// Output tensor:
// [[3, 4, 7], [1, 4, 4]]

Note: When comparing with NaN values, NaN is typically not selected as the maximum value.

Mathematical definition: maximum(x, y) = max(x, y)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.mean (tt::ttir::MeanOp)

Mean reduction op.

The mean operation computes the arithmetic mean of elements along specified dimensions of the input tensor.

This operation reduces the input tensor by computing the average of all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the mean is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example:

// Mean along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2xf32>
%result = ttir.mean(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xf32>, tensor<2xf32> -> tensor<2xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [2.0, 5.0]  // Mean of each row

// Mean along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<3xf32>
%result = ttir.mean(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [2.5, 3.5, 4.5]  // Mean of each column

// Mean over all dimensions
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<f32>
%result = ttir.mean(%input, %output) {keep_dim = false} : tensor<2x3xf32>, tensor<f32> -> tensor<f32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// 3.5  // Mean of all elements

Note: For integer input tensors, the result is typically rounded to the nearest integer according to the rounding mode.

Mathematical definition: mean(x, dim) = (∑ x[i]) / n for all i in dimension dim, where n is the number of elements in dimension dim

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • keep_dim (Bool): Whether to keep the reduced dimensions or not.
  • dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

  • output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.mesh_shard (tt::ttir::MeshShardOp)

Mesh shard operation.

MeshShard op shards the inputs (FullToShard) or concatnates the outputs (ShardToFull) for ccl ops.

shard_direction attribute determines whether to shard or concat.

shard_type attribute determines how to shard or concat. manual: no sharding replicate: all devices have identical data maximal: only one device contains full data devices: shard_shape/shard_dims determine particular sharding

shard_dims attribute determines row and column sharding dimension of input tensor

For example, on 2x4 mesh hardware, following op shards arg0 to 8 slices, row divided by 2 and col divided by 4.

%1 = "ttir.mesh_shard"(%arg0, %0) < {... shard_direction = #tt.shard_direction<full_to_shard>, shard_shape = array<i64: 2, 4>, shard_dims = array<i64: 0, 1>, shard_type = #tt.shard_type}> : (tensor<8192x784xf32>, ...) -> tensor<4096x196xf32>

On the other hand, this op concatnates %4 to single tensor by concatnating one of the top row tensor with one of the bottom row tensor.

%6 = "ttir.mesh_shard"(%4, %5) < {..., shard_direction = #tt.shard_direction<shard_to_full>, shard_shape = array<i64: 2, 1>, shard_dims = arrray<i64: 1, -1>, shard_type = #tt.shard_type}> : (tensor<4096x16384xf32>, ...) -> tensor<8192x16384xf32>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shard_type::mlir::tt::MeshShardTypeAttr
MeshShard shard_type attribute in TT dialect{{% markdown %}} Define sharded tensor data of mesh_shard op. - Identity: input and output tensors are pre-sharded (same data) and no sharding is required. - Replicate: all of the devices has full tensor (same data). - Maximal: one or part of the devcices has full tensor (same data). - Devices: all or part of the devices has sharded (partial) tensor (different data). {{% /markdown %}}
shard_direction::mlir::tt::MeshShardDirectionAttrTT MeshShardDirection
shard_shape::mlir::DenseI64ArrayAttri64 dense array attribute
shard_dims::mlir::DenseI64ArrayAttri64 dense array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.min (tt::ttir::MinOp)

Minimum reduction operation.

The min operation computes the minimum value of elements along specified dimensions of the input tensor.

This operation reduces the input tensor by finding the minimum value of all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the minimum is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example:

// Minimum along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2xf32>
%result = ttir.min(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xf32>, tensor<2xf32> -> tensor<2xf32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// [1.0, 2.0]  // Minimum of each row

// Minimum along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<3xf32>
%result = ttir.min(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// [1.0, 2.0, 3.0]  // Minimum of each column

// Minimum over all dimensions
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<f32>
%result = ttir.min(%input, %output) {keep_dim = false} : tensor<2x3xf32>, tensor<f32> -> tensor<f32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// 1.0  // Minimum of all elements

Note: When comparing with NaN values, NaN is typically not selected as the minimum value.

Mathematical definition: min(x, dim) = min(x[i]) for all i in dimension dim

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • keep_dim (Bool): Whether to keep the reduced dimensions or not.
  • dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

  • output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.minimum (tt::ttir::MinimumOp)

Elementwise minimum operation.

The minimum operation computes the elementwise minimum between two tensors.

For each pair of corresponding elements, it selects the smaller value and places it in the output tensor. This operation has the idempotence property, meaning that applying it twice with the same second operand returns the original result: minimum(minimum(x, y), y) = minimum(x, y).

Example:

// Minimum operation
%result = ttir.minimum(%lhs, %rhs, %output) : tensor<2x3xi32>, tensor<2x3xi32>, tensor<2x3xi32> -> tensor<2x3xi32>
// Input tensors:
// %lhs: [[3, 2, 7], [1, 4, 4]]
// %rhs: [[1, 4, 2], [1, 2, 3]]
// Output tensor:
// [[1, 2, 2], [1, 2, 3]]

// Example with floating point values
%result = ttir.minimum(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [3.5, -2.1, 0.0]
// %float_rhs: [1.2, -5.0, 0.0]
// Output tensor:
// [1.2, -5.0, 0.0]

Note: When comparing with NaN values, NaN is typically not selected as the minimum value.

Mathematical definition: minimum(x, y) = min(x, y)

Traits: AlwaysSpeculatableImplTrait, TTIR_BinaryIdempotence, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_PartiallyBroadcastable

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.multiply (tt::ttir::MultiplyOp)

Elementwise multiplication operation.

The multiply operation performs an elementwise multiplication between two tensors.

For each pair of corresponding elements, it multiplies the elements and places the result in the output tensor.

Example:

// Multiplication operation
%result = ttir.multiply(%lhs, %rhs, %output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensors:
// %lhs: [10, 20, 30]
// %rhs: [1, 2, 3]
// Output tensor:
// [10, 40, 90]

// Example with floating point values
%result = ttir.multiply(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [3.5, 0.0, -1.2]
// %float_rhs: [1.5, 2.0, -3.2]
// Output tensor:
// [5.25, 0.0, -3.84]

Note: The data type of the output tensor matches the data type of the input tensors.

Mathematical definition: multiply(x, y) = x * y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_PartiallyBroadcastable

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.neg (tt::ttir::NegOp)

Elementwise negate operation.

The neg operation negates each element in the input tensor.

For each element, it returns the negation of the value. The operation preserves the data type of the input.

Example:

// Compute negation of all elements in %input
%result = ttir.neg(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[-1.7, -2.0, 0.3, -4.5], ... ]

Mathematical definition: neg(x) = -x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Involution, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.ne (tt::ttir::NotEqualOp)

Elementwise inequality comparison operation.

The ne operation performs an elementwise inequality comparison between two tensors.

For each pair of corresponding elements, it returns:

  • 1 (true) if the elements are not equal
  • 0 (false) if the elements are equal

Note that special handling may be required for floating-point NaN values, as NaN is not equal to any value, including itself. This means ne(NaN, NaN) should return true.

Example:

// Compare elements for inequality
%result = ttir.ne(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[0, 0, 1, 1], ... ]  // 0 where equal, 1 where not equal

// Example with integer tensors
%result = ttir.ne(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [0, 1, 1]  // Only the first elements are equal, so their result is 0

Mathematical definition: not_equal(x, y) = x != y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.ones (tt::ttir::OnesOp)

Creates a tensor filled with ones.

The ones operation creates a tensor filled with ones of the specified shape.

This operation is commonly used to initialize tensors with one values. It takes a shape attribute and produces a tensor of that shape with all elements set to one.

Example:

// Create a 3D tensor of ones with shape [64, 28, 28]
%result = ttir.ones() {
    shape = [64, 28, 28]
} : () -> tensor<64x28x28xbf16>
// Result: A tensor of shape [64, 28, 28] filled with ones

// Create a 2D tensor of ones with shape [3, 4]
%result = ttir.ones() {
    shape = [3, 4]
} : () -> tensor<3x4xf32>
// Result: [[1.0, 1.0, 1.0, 1.0],
//          [1.0, 1.0, 1.0, 1.0],
//          [1.0, 1.0, 1.0, 1.0]]

Attributes:

  • shape (Array of Integer): The shape of the tensor to create.

Outputs:

  • result (Tensor): The tensor filled with ones.

Note: The element type of the result tensor is determined by the return type specified in the operation. This operation is useful for initializing tensors before scaling them or as a starting point for operations that require tensors filled with ones, such as creating masks or constant multipliers.

Traits: AlwaysSpeculatableImplTrait, TT_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shape::mlir::DenseI32ArrayAttri32 dense array attribute

Results:

ResultDescription
resultranked tensor of any type values

ttir.pad (tt::ttir::PadOp)

Tensor padding operation.

The pad operation adds padding to the edges of an input tensor with a specified constant value.

This operation extends the dimensions of the input tensor by adding padding elements with a constant value. The padding is specified for each dimension as the number of elements to add at the beginning (low) and end (high) of that dimension.

The padding attribute must be a sequence of integers that is twice the size as the rank of the input. Each pair of integers in the padding attribute represents the amount of padding to add to the low and high of that dimension. For example, for a 2D tensor, the padding attribute would have 4 values: [dim0_low, dim0_high, dim1_low, dim1_high].

Example:

// Pad a 2x3 tensor with different padding on each dimension
%input = ... : tensor<2x3xf32>  // Input tensor with values:
                                // [[1, 2, 3],
                                //  [4, 5, 6]]
%output = ttir.empty() : tensor<3x5xf32>  // Output tensor shape
%result = ttir.pad(%input, %output) {
    padding = [1, 0, 1, 1],  // Format: [dim0_low, dim0_high, dim1_low, dim1_high]
    value = 0.0 : f32
} : tensor<2x3xf32>, tensor<3x5xf32> -> tensor<3x5xf32>
// Result:
// [[0, 0, 0, 0, 0],
//  [0, 1, 2, 3, 0],
//  [0, 4, 5, 6, 0]]

Inputs:

  • input (Tensor): The input tensor to pad.

Attributes:

  • padding (Array of Integer): The padding values for each dimension, specified as [dim0_low, dim0_high, dim1_low, dim1_high, ...].
  • value (Float): The constant value to use for the padding elements.

Outputs:

  • result (Tensor): The padded tensor.

Note: The shape of the output tensor must match the shape of the input tensor plus the padding specified in the padding attribute. For example, if the input shape is [2,3] and the padding is [1,0,1,1], then the output shape must be [3,5].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
padding::mlir::DenseI32ArrayAttri32 dense array attribute
value::mlir::FloatAttr32-bit float attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.permute (tt::ttir::PermuteOp)

Tensor dimension permutation operation.

The permute operation reorders the dimensions of the input tensor according to the specified permutation.

This operation is similar to transpose but generalizes to tensors of any rank. It rearranges the dimensions of the input tensor based on the permutation attribute, which specifies the new order of dimensions.

Example:

// Transpose a 2D tensor (swap dimensions 0 and 1)
%input = ... : tensor<3x4xf32>  // Input tensor with shape [3,4]
%output = ttir.empty() : tensor<4x3xf32>  // Output tensor shape
%result = ttir.permute(%input, %output) {
    permutation = [1, 0]  // Swap dimensions 0 and 1
} : tensor<3x4xf32>, tensor<4x3xf32> -> tensor<4x3xf32>
// Result: tensor with shape [4,3], equivalent to transposing the input

// Permute a 3D tensor
%input = ... : tensor<2x3x4xf32>  // Input tensor with shape [2,3,4]
%output = ttir.empty() : tensor<3x4x2xf32>  // Output tensor shape
%result = ttir.permute(%input, %output) {
    permutation = [1, 2, 0]  // Reorder dimensions to [1,2,0]
} : tensor<2x3x4xf32>, tensor<3x4x2xf32> -> tensor<3x4x2xf32>
// Result: tensor with shape [3,4,2]

Inputs:

  • input (Tensor): The input tensor to permute.

Attributes:

  • permutation (Array of Integer): The permutation of the input tensor dimensions. This must be a valid permutation of the indices [0, 1, ..., rank-1].

Outputs:

  • result (Tensor): The permuted tensor.

Note: The permutation attribute must contain exactly one occurrence of each integer in the range [0, rank-1], where rank is the number of dimensions in the input tensor. The shape of the output tensor is determined by permuting the dimensions of the input tensor according to the permutation. For example, if the input shape is [2,3,4] and the permutation is [1,2,0], then the output shape will be [3,4,2].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_TensorManipulation

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
permutation::mlir::DenseI64ArrayAttri64 dense array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.pooling (tt::ttir::PoolingOp)

General pooling operation.

The pooling operation is a generalized pooling operation that can implement various pooling methods such as max pooling, average pooling, and sum pooling.

Pooling operations are commonly used in neural networks to reduce the spatial dimensions of feature maps by applying a specific function (like maximum or average) over local regions of the input tensor.

Example:

// Max pooling with 2x2 window and stride 2
%input = ... : tensor<1x32x32x16xf32>    // Batch size 1, 32x32 feature map, 16 channels
%output = ttir.empty() : tensor<1x16x16x16xf32>  // Output tensor
%result = ttir.pooling(%input, %output) {
    pooling_method = "MAX",
    window_dimensions = [1, 2, 2, 1],
    window_strides = [1, 2, 2, 1],
    base_dilations = [1, 1, 1, 1],
    window_dilations = [1, 1, 1, 1],
    padding = [0, 0, 0, 0, 0, 0, 0, 0]
} : tensor<1x32x32x16xf32>, tensor<1x16x16x16xf32> -> tensor<1x16x16x16xf32>

// Average pooling with 3x3 window and stride 2
%input = ... : tensor<1x32x32x16xf32>    // Batch size 1, 32x32 feature map, 16 channels
%output = ttir.empty() : tensor<1x15x15x16xf32>  // Output tensor
%result = ttir.pooling(%input, %output) {
    pooling_method = "AVG",
    window_dimensions = [1, 3, 3, 1],
    window_strides = [1, 2, 2, 1],
    base_dilations = [1, 1, 1, 1],
    window_dilations = [1, 1, 1, 1],
    padding = [0, 0, 0, 0, 0, 0, 0, 0]
} : tensor<1x32x32x16xf32>, tensor<1x15x15x16xf32> -> tensor<1x15x15x16xf32>

Inputs:

  • inputs (Variadic Tensor): Input tensors to be pooled.

Attributes:

  • pooling_method (Enum): The pooling method to use (MAX, AVG, SUM).
  • window_dimensions (Array of Integer): Dimensions of the pooling window.
  • window_strides (Array of Integer): Stride of the pooling window.
  • base_dilations (Array of Integer): Dilation factors for the input.
  • window_dilations (Array of Integer): Dilation factors for the pooling window.
  • padding (Array of Integer): Padding to apply to the input.

Outputs:

  • results (Variadic Tensor): Output tensors after pooling.

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
pooling_method::mlir::tt::ttir::PoolingMethodAttrTTIR PoolingMethod
window_dimensions::mlir::DenseI64ArrayAttri64 dense array attribute
window_strides::mlir::DenseI64ArrayAttri64 dense array attribute
base_dilations::mlir::DenseI64ArrayAttri64 dense array attribute
window_dilations::mlir::DenseI64ArrayAttri64 dense array attribute
padding::mlir::DenseI64ArrayAttri64 dense array attribute

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values
outputsvariadic of ranked tensor of any type values

Results:

ResultDescription
«unnamed»variadic of ranked tensor of any type values

ttir.pow (tt::ttir::PowOp)

Elementwise power operation.

The pow operation performs an elementwise exponentiation between two tensors.

For each pair of corresponding elements, it raises the element in the first tensor (base) to the power of the element in the second tensor (exponent) and places the result in the output tensor.

Example:

// Power operation
%result = ttir.pow(%lhs, %rhs, %output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %lhs: [2.0, 3.0, 4.0]  // Bases
// %rhs: [2.0, 2.0, 0.5]  // Exponents
// Output tensor:
// [4.0, 9.0, 2.0]

// Example with integer values
%result = ttir.pow(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensors:
// %int_lhs: [2, 3, 5]
// %int_rhs: [3, 2, 1]
// Output tensor:
// [8, 9, 5]

Special cases:

  • 0^0 is typically defined as 1
  • For integer types, negative bases with non-integer exponents may result in complex numbers, which are typically not supported and may result in undefined behavior

Mathematical definition: pow(x, y) = x^y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.prod (tt::ttir::ProdOp)

Product reduction op.

The `prod` operation computes the product of elements along specified dimensions of the input tensor.

This operation reduces the input tensor by multiplying all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the product is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example:

// Product along dimension 0
%input = ... : tensor<2x3xi32>
%output = ttir.empty() : tensor<3xi32>
%result = ttir.prod(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensor:
// [[1, 2, 3],
//  [4, 5, 6]]
// Output tensor:
// [4, 10, 18]  // Product of each column

// Product along dimension 1
%input = ... : tensor<2x3xi32>
%output = ttir.empty() : tensor<2xi32>
%result = ttir.prod(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xi32>, tensor<2xi32> -> tensor<2xi32>
// Input tensor:
// [[1, 2, 3],
//  [4, 5, 6]]
// Output tensor:
// [6, 120]  // Product of each row

// Product over all dimensions
%input = ... : tensor<2x3xi32>
%output = ttir.empty() : tensor<i32>
%result = ttir.prod(%input, %output) {keep_dim = false} : tensor<2x3xi32>, tensor<i32> -> tensor<i32>
// Input tensor:
// [[1, 2, 3],
//  [4, 5, 6]]
// Output tensor:
// 720  // Product of all elements

Note: For floating-point inputs, the order of multiplication may affect the result due to floating-point precision issues.

Mathematical definition: prod(x, dim) = ∏ x[i] for all i in dimension dim

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • keep_dim (Bool): Whether to keep the reduced dimensions or not.
  • dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

  • output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.quantize (tt::ttir::QuantizeOp)

Quantize operation.

The Quantize operation converts a tensor into a quantized tensor using the quant.uniform type from the MLIR Quant dialect. This type encapsulates the scale and zero-point metadata directly within the tensor type. The output tensor will be of type 'quant.uniform', where each element is computed as:

output[i] = (input[i] / scale) + zero_point

Example:

%input = ttir.empty() : () -> tensor<64x128xf32>
%output = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%quantized = "ttir.quantize"(%input, %output) : (tensor<64x128xf32>, tensor<64x128x!quant.uniform<i32:f32, 0.1>>) -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>

// In this example:
// - The input is a 64x128 tensor of 32-bit floating-point values
// - The output is a 64x128 tensor of 32-bit quantized values
// - The scale is 0.1 (each step represents 0.1 in the original scale)
// - The zero point is 128 (the value 128 in the quantized space represents 0.0 in the original space)

Inputs:

  • input (Tensor): Input tensor to be quantized.

Results:

  • result (Quantized Tensor): The quantized tensor with type quant.uniform.

Note: The quantization parameters (scale and zero point) are specified in the result type. Quantization helps reduce model size and computational requirements by representing floating-point values with lower-precision integers, which is particularly useful for deployment on resource-constrained devices.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.quantize_unrolled (tt::ttir::QuantizeUnrolledOp)

Quantize operation unrolled (scale and zero point as input operands).

The QuantizeUnrolledOp quantizes a tensor using the scale and zero point provided as input operands.

Inputs:

  • input AnyRankedTensor: The input tensor to be quantized. Must have floating-point element type.
  • scale AnyRankedTensor: The scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
  • zero_point AnyRankedTensor: The zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
  • axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
axis::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
scaleranked tensor of any type values
zero_pointranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.reciprocal (tt::ttir::ReciprocalOp)

Eltwise reciprocal.

The reciprocal operation computes the reciprocal (1/x) of each element in the input tensor.

For each element, it returns the reciprocal of the value.

Example:

// Compute reciprocal of all elements in %input
%result = ttir.reciprocal(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.5882, 0.5, -3.3333, 0.2173], ... ]

Mathematical definition: reciprocal(x) = 1 / x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Involution, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.reduce_and (tt::ttir::ReduceAndOp)

Logical AND reduction operation.

The reduce_and operation performs a logical AND reduction along specified dimensions of the input tensor.

This operation reduces the input tensor by applying a logical AND operation to all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the reduction is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

The operation treats non-zero values as True and zero values as False when performing the logical AND.

Example:

// Logical AND reduction along dimension 0
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<4xi1>
%result = ttir.reduce_and(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<4x4xi1>, tensor<4xi1> -> tensor<4xi1>
// Input tensor (where 1 represents True and 0 represents False):
// [[1, 0, 1, 0],
//  [1, 1, 1, 1],
//  [0, 0, 1, 1],
//  [0, 1, 1, 0]]
// Output tensor:
// [0, 0, 1, 0]  // Logical AND of each column

// Logical AND reduction along dimension 1
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<4xi1>
%result = ttir.reduce_and(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<4x4xi1>, tensor<4xi1> -> tensor<4xi1>
// Input tensor:
// [[1, 0, 1, 0],
//  [1, 1, 1, 1],
//  [0, 0, 1, 1],
//  [0, 1, 1, 0]]
// Output tensor:
// [0, 1, 0, 0]  // Logical AND of each row

// Logical AND reduction over all dimensions
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<i1>
%result = ttir.reduce_and(%input, %output) {keep_dim = false} : tensor<4x4xi1>, tensor<i1> -> tensor<i1>
// Input tensor:
// [[1, 0, 1, 0],
//  [1, 1, 1, 1],
//  [0, 0, 1, 1],
//  [0, 1, 1, 0]]
// Output tensor:
// 0  // Logical AND of all elements

Mathematical definition: reduce_and(x, dim) = AND(x[i]) for all i in dimension dim

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • keep_dim (Bool): Whether to keep the reduced dimensions or not.
  • dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

  • output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.reduce_or (tt::ttir::ReduceOrOp)

Logical OR reduction operation.

The reduce_or operation performs a logical OR reduction along specified dimensions of the input tensor.

This operation reduces the input tensor by applying a logical OR operation to all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the reduction is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

The operation treats non-zero values as True and zero values as False when performing the logical OR.

Example:

// Logical OR reduction along dimension 0
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<4xi1>
%result = ttir.reduce_or(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<4x4xi1>, tensor<4xi1> -> tensor<4xi1>
// Input tensor (where 1 represents True and 0 represents False):
// [[1, 0, 0, 0],
//  [1, 1, 0, 1],
//  [0, 0, 0, 1],
//  [0, 0, 0, 0]]
// Output tensor:
// [1, 1, 0, 1]  // Logical OR of each column

// Logical OR reduction along dimension 1
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<4xi1>
%result = ttir.reduce_or(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<4x4xi1>, tensor<4xi1> -> tensor<4xi1>
// Input tensor:
// [[1, 0, 0, 0],
//  [1, 1, 0, 1],
//  [0, 0, 0, 1],
//  [0, 0, 0, 0]]
// Output tensor:
// [1, 1, 1, 0]  // Logical OR of each row

// Logical OR reduction over all dimensions
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<i1>
%result = ttir.reduce_or(%input, %output) {keep_dim = false} : tensor<4x4xi1>, tensor<i1> -> tensor<i1>
// Input tensor:
// [[1, 0, 0, 0],
//  [1, 1, 0, 1],
//  [0, 0, 0, 1],
//  [0, 0, 0, 0]]
// Output tensor:
// 1  // Logical OR of all elements

Mathematical definition: reduce_or(x, dim) = OR(x[i]) for all i in dimension dim

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • keep_dim (Bool): Whether to keep the reduced dimensions or not.
  • dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

  • output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.reduce_scatter (tt::ttir::ReduceScatterOp)

Reduce scatter operation.

Reduce scatter op.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
reduce_type::mlir::tt::ReduceTypeAttrTT Reduce Type
scatter_dim::mlir::IntegerAttr32-bit signed integer attribute
cluster_axis::mlir::IntegerAttr32-bit unsigned integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.relu (tt::ttir::ReluOp)

Eltwise ReLU.

The relu operation computes the rectified linear unit (ReLU) of each element in the input tensor.

For each element, it returns the maximum of 0 and the value. The operation preserves the data type of the input.

Example:

// Compute ReLU of all elements in %input
%result = ttir.relu(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[1.7, 2.0, 0.0, 4.5], ... ]

Mathematical definition: relu(x) = max(0, x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Idempotence, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.remainder (tt::ttir::RemainderOp)

Elementwise remainder operation.

The remainder operation performs an elementwise remainder (modulo) operation between two tensors.

For each pair of corresponding elements, it computes the remainder when dividing the element in the first tensor (dividend) by the element in the second tensor (divisor) and places the result in the output tensor.

Example:

// Remainder operation
%result = ttir.remainder(%lhs, %rhs, %output) : tensor<4xi64>, tensor<4xi64>, tensor<4xi64> -> tensor<4xi64>
// Input tensors:
// %lhs: [17, -17, 17, -17]  // Dividends
// %rhs: [3, 3, -3, -3]      // Divisors
// Output tensor:
// [2, -2, 2, -2]

// Example with floating point values
%result = ttir.remainder(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [10.5, -10.5, 3.0]
// %float_rhs: [3.0, 3.0, 2.0]
// Output tensor:
// [1.5, -1.5, 1.0]

Note: Division by zero typically results in undefined behavior or NaN for floating-point types.

Mathematical definition: remainder(x, y) = x % y (where % is the remainder operator)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.repeat_interleave (tt::ttir::RepeatInterleaveOp)

Tensor repeat interleave operation.

The repeat_interleave operation repeats elements of a tensor along a specified dimension.

Unlike the repeat operation which repeats the entire tensor, this operation repeats each individual element of the input tensor the specified number of times along the given dimension. This creates an interleaved pattern of repeated values.

Example:

// Repeat interleave along dimension 0 with repeats=2
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<4x3xf32>
%result = ttir.repeat_interleave(%input, %output) {repeats = 2 : ui32, dim = 0 : i32} :
    tensor<2x3xf32>, tensor<4x3xf32> -> tensor<4x3xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [[1.0, 2.0, 3.0],  // First row repeated
//  [1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0],  // Second row repeated
//  [4.0, 5.0, 6.0]]

// Repeat interleave along dimension 1 with repeats=3
%input = ... : tensor<2x2xf32>
%output = ttir.empty() : tensor<2x6xf32>
%result = ttir.repeat_interleave(%input, %output) {repeats = 3 : ui32, dim = 1 : i32} :
    tensor<2x2xf32>, tensor<2x6xf32> -> tensor<2x6xf32>
// Input tensor:
// [[1.0, 2.0],
//  [3.0, 4.0]]
// Output tensor:
// [[1.0, 1.0, 1.0, 2.0, 2.0, 2.0],  // Each element repeated 3 times
//  [3.0, 3.0, 3.0, 4.0, 4.0, 4.0]]

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • repeats (Integer): The number of times to repeat each element.
  • dim (Integer): The dimension along which to repeat elements.

Outputs:

  • result (Tensor): The tensor with repeated elements.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
repeats::mlir::IntegerAttr32-bit unsigned integer attribute
dim::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.repeat (tt::ttir::RepeatOp)

Repeat operation.

The repeat operation creates a new tensor by replicating the input tensor's elements along specified dimensions.

This operation repeats the entire input tensor along each dimension according to the values specified in the repeat_dimensions attribute. The resulting tensor's shape is the product of the input tensor's shape and the corresponding repeat values.

Example:

// Repeat a 2x3 tensor with repeat dimensions [2, 2]
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<4x6xf32>
%result = ttir.repeat(%input, %output) {repeat_dimensions = [2, 2]} :
    tensor<2x3xf32>, tensor<4x6xf32> -> tensor<4x6xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [[1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0, 4.0, 5.0, 6.0],
//  [1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0, 4.0, 5.0, 6.0]]

// Repeat a 2x2 tensor with repeat dimensions [1, 3]
%input = ... : tensor<2x2xf32>
%output = ttir.empty() : tensor<2x6xf32>
%result = ttir.repeat(%input, %output) {repeat_dimensions = [1, 3]} :
    tensor<2x2xf32>, tensor<2x6xf32> -> tensor<2x6xf32>
// Input tensor:
// [[1.0, 2.0],
//  [3.0, 4.0]]
// Output tensor:
// [[1.0, 2.0, 1.0, 2.0, 1.0, 2.0],
//  [3.0, 4.0, 3.0, 4.0, 3.0, 4.0]]

Inputs:

  • input (Tensor): The input tensor to repeat.

Attributes:

  • repeat_dimensions (Array of Integer): The number of times to repeat the tensor along each dimension.

Outputs:

  • result (Tensor): The repeated tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
repeat_dimensions::mlir::DenseI64ArrayAttri64 dense array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.requantize (tt::ttir::RequantizeOp)

Requantize operation.

The Requantize operation converts a quantized tensor from one scale and zero-point to another, using the quant.uniform type from the MLIR Quant dialect. The input tensor is expected to be of type quant.uniform. The output tensor will also be of type quant.uniform. Each element in the output tensor is computed as:

output[i] = round((input[i] - input_zero_point) * (input_scale / output_scale)) + output_zero_point

Example:

%input = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%output = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.2>>
%requantized = "ttir.requantize"(%input, %output) : (tensor<64x128x!quant.uniform<i32:f32, 0.1>, tensor<64x128x!quant.uniform<i32:f32, 0.2>>) -> tensor<64x128x!quant.uniform<i32:f32, 0.2>>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.requantize_unrolled (tt::ttir::RequantizeUnrolledOp)

Requantize operation unrolled (scale and zero point as input operands).

The RequantizeUnrolledOp requantizes a tensor using the scale and zero point provided as input operands.

Inputs:

  • input AnyRankedTensor: The input tensor to be requantized. Must have quantized element type.
  • in_scale AnyRankedTensor: The input scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
  • in_zero_point AnyRankedTensor: The input zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
  • out_scale AnyRankedTensor: The output scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
  • out_zero_point AnyRankedTensor: The output zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
  • axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
axis::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
in_scaleranked tensor of any type values
in_zero_pointranked tensor of any type values
out_scaleranked tensor of any type values
out_zero_pointranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.reshape (tt::ttir::ReshapeOp)

Tensor reshape operation.

The reshape operation changes the shape of a tensor without changing the data or number of elements.

This operation takes an input tensor and reshapes it to a new shape specified by the shape attribute. The total number of elements in the tensor must remain the same after reshaping. This operation is commonly used in neural networks to change the dimensionality of tensors between layers.

Example:

// Reshape a 2x3 tensor to a 1x6 tensor
%input = ... : tensor<2x3xf32>  // Input tensor with shape [2,3]
%output = ttir.empty() : tensor<1x6xf32>  // Output tensor with shape [1,6]
%result = ttir.reshape(%input, %output) {shape = [1, 6]} :
    tensor<2x3xf32>, tensor<1x6xf32> -> tensor<1x6xf32>

// Reshape a 3D tensor to a 2D tensor
%input = ... : tensor<2x3x4xf32>  // Input tensor with shape [2,3,4]
%output = ttir.empty() : tensor<6x4xf32>  // Output tensor with shape [6,4]
%result = ttir.reshape(%input, %output) {shape = [6, 4]} :
    tensor<2x3x4xf32>, tensor<6x4xf32> -> tensor<6x4xf32>

Inputs:

  • input (Tensor): The input tensor to reshape.

Attributes:

  • shape (Array of Integer): The new shape for the tensor.

Outputs:

  • result (Tensor): The reshaped tensor.

Note: The total number of elements in the input tensor must equal the total number of elements in the output tensor. For example, a tensor of shape [2,3] (6 elements) can be reshaped to [1,6], [6,1], [2,1,3], etc., but not to [2,4] (8 elements).

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_TensorManipulation

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shape::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.reverse (tt::ttir::ReverseOp)

Tensor reversal operation.

The reverse operation reverses the order of elements in the input tensor along the specified dimensions.

This operation flips the elements of a tensor along one or more axes, which is useful for operations like sequence reversal, matrix transposition with reversal, and other tensor manipulations that require changing the order of elements.

Example:

// Reverse a 3x2 tensor along dimension 1 (columns)
%input = ... : tensor<3x2xi32>  // Input tensor with values:
                                // [[1, 2],
                                //  [3, 4],
                                //  [5, 6]]
%output = ttir.empty() : tensor<3x2xi32>  // Output tensor shape
%result = ttir.reverse(%input, %output) {
    dimensions = [1]  // Reverse along columns
} : tensor<3x2xi32>, tensor<3x2xi32> -> tensor<3x2xi32>
// Result:
// [[2, 1],
//  [4, 3],
//  [6, 5]]

// Reverse a 3x2 tensor along both dimensions
%input = ... : tensor<3x2xi64>  // Input tensor with values:
                                // [[1, 2],
                                //  [3, 4],
                                //  [5, 6]]
%output = ttir.empty() : tensor<3x2xi64>  // Output tensor shape
%result = ttir.reverse(%input, %output) {
    dimensions = [0, 1]  // Reverse along both rows and columns
} : tensor<3x2xi64>, tensor<3x2xi64> -> tensor<3x2xi64>
// Result:
// [[6, 5],
//  [4, 3],
//  [2, 1]]

Inputs:

  • input (Tensor): The input tensor to reverse.

Attributes:

  • dimensions (Array of Integer): The dimensions along which to reverse the tensor.

Outputs:

  • result (Tensor): The reversed tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dimensions::mlir::DenseI64ArrayAttri64 dense array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.rsqrt (tt::ttir::RsqrtOp)

Eltwise reciprocal square root.

The rsqrt operation computes the reciprocal square root of each element in the input tensor.

For each element, it returns the reciprocal of the square root of the value.

Example:

// Compute reciprocal square root of all elements in %input
%result = ttir.rsqrt(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.5882, 0.5, -3.3333, 0.2173], ... ]

Mathematical definition: rsqrt(x) = 1 / sqrt(x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.scatter (tt::ttir::ScatterOp)

Scatter operation

The scatter operation updates slices of an input tensor at indices specified by scatter_indices with values from the update tensor.

This operation is the inverse of the gather operation. It allows for updating specific slices of a tensor at locations determined by indices. The operation is highly configurable through various dimension attributes that control how the indices and updates are interpreted.

Example:

// Basic scatter example: update values at specific indices in a 1D tensor
%input = ... : tensor<8xf32>        // Input tensor with values: [0, 0, 0, 0, 0, 0, 0, 0]
%indices = ... : tensor<3xi32>      // Indices tensor with values: [1, 3, 5]
%update = ... : tensor<3xf32>       // Update tensor with values: [10, 30, 50]
%output = ttir.empty() : tensor<8xf32>  // Output tensor shape
%result = ttir.scatter(%input, %indices, %update, %output) {
    update_window_dims = [],        // No window dimensions in update tensor
    inserted_window_dims = [0],     // Insert window dimension 0
    input_batching_dims = [],       // No batching dimensions in input
    scatter_indices_batching_dims = [], // No batching dimensions in indices
    scatter_dims_to_operand_dims = [0], // Map scatter dimension 0 to operand dimension 0
    index_vector_dim = 0,           // Indices are in dimension 0
    indices_are_sorted = true,      // Indices are sorted
    unique_indices = true           // Indices are unique
} : tensor<8xf32>, tensor<3xi32>, tensor<3xf32>, tensor<8xf32> -> tensor<8xf32>
// Result: [0, 10, 0, 30, 0, 50, 0, 0]

// Scatter to update a 2D tensor
%input = ... : tensor<4x4xf32>      // Input tensor (4x4 matrix of zeros)
%indices = ... : tensor<2x2xi32>    // Indices tensor with values: [[0, 1], [2, 3]]
%update = ... : tensor<2xf32>       // Update tensor with values: [100, 200]
%output = ttir.empty() : tensor<4x4xf32>  // Output tensor shape
%result = ttir.scatter(%input, %indices, %update, %output) {
    update_window_dims = [],
    inserted_window_dims = [0, 1],
    input_batching_dims = [],
    scatter_indices_batching_dims = [0],
    scatter_dims_to_operand_dims = [0, 1],
    index_vector_dim = 1,
    indices_are_sorted = false,
    unique_indices = true
} : tensor<4x4xf32>, tensor<2x2xi32>, tensor<2xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Result: A 4x4 tensor with 100 at position [0,1] and 200 at position [2,3]

Inputs:

  • input (Tensor): The tensor to be updated.
  • scatter_indices (Tensor): Tensor containing the starting indices for slices to update.
  • update (Tensor): Tensor containing values to scatter into the input tensor.

Attributes:

  • update_window_dims (Array of Integer): Dimensions in update that are window dimensions.
  • inserted_window_dims (Array of Integer): Dimensions in the output that are not present in update.
  • input_batching_dims (Array of Integer): Batch dimensions in the input tensor.
  • scatter_indices_batching_dims (Array of Integer): Batch dimensions in the scatter indices tensor.
  • scatter_dims_to_operand_dims (Array of Integer): Maps dimensions in scatter indices to dimensions in operand.
  • index_vector_dim (Integer): The dimension in scatter indices that contains the index vector.
  • indices_are_sorted (Boolean): Whether indices are sorted lexicographically.
  • unique_indices (Boolean): Whether indices are guaranteed to be unique.

Outputs:

  • result (Tensor): The updated tensor.

Note: The semantics of this operation are complex and based on the StableHLO scatter operation. The configuration of the various dimension attributes determines exactly how the scatter indices are interpreted and how the update values are applied to the input tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
update_window_dims::mlir::DenseI32ArrayAttri32 dense array attribute
inserted_window_dims::mlir::DenseI32ArrayAttri32 dense array attribute
input_batching_dims::mlir::DenseI32ArrayAttri32 dense array attribute
scatter_indices_batching_dims::mlir::DenseI32ArrayAttri32 dense array attribute
scatter_dims_to_operand_dims::mlir::DenseI32ArrayAttri32 dense array attribute
index_vector_dim::mlir::IntegerAttr32-bit signless integer attribute
indices_are_sorted::mlir::BoolAttrbool attribute
unique_indices::mlir::BoolAttrbool attribute

Operands:

OperandDescription
inputranked tensor of any type values
scatter_indicesranked tensor of any type values
updateranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.select (tt::ttir::SelectOp)

Tensor selection operation.

The select operation extracts a sub-tensor (slice) from the input tensor along a specified dimension.

Unlike the more general slice operation, select operates on a single dimension with a specified starting index, length, and optional stride. This is useful for extracting specific segments of a tensor along a particular axis.

Example:

// Select elements 2, 3, 4 from a 1D tensor along dimension 0
%input = ... : tensor<6xf32>  // Input tensor with values: [1, 2, 3, 4, 5, 6]
%output = ttir.empty() : tensor<3xf32>  // Output tensor shape
%result = ttir.select(%input, %output) {
    dim = 0 : i32,     // Dimension to select from
    begin = 2 : i32,   // Start index
    length = 3 : i32,  // Number of elements to select
    stride = 0 : i32   // No stride (consecutive elements)
} : tensor<6xf32>, tensor<3xf32> -> tensor<3xf32>
// Result: [3, 4, 5]

// Select every other row from a 2D tensor
%input = ... : tensor<4x3xf32>  // Input tensor with values:
                                // [[1, 2, 3],
                                //  [4, 5, 6],
                                //  [7, 8, 9],
                                //  [10, 11, 12]]
%output = ttir.empty() : tensor<2x3xf32>  // Output tensor shape
%result = ttir.select(%input, %output) {
    dim = 0 : i32,     // Select along rows
    begin = 0 : i32,   // Start from the first row
    length = 2 : i32,  // Select 2 rows
    stride = 2 : i32   // Select every other row
} : tensor<4x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// Result:
// [[1, 2, 3],
//  [7, 8, 9]]

Inputs:

  • input (Tensor): The input tensor to select from.

Attributes:

  • dim (Integer): The dimension along which to select elements.
  • begin (Integer): The starting index for selection.
  • length (Integer): The number of elements to select.
  • stride (Integer, default=0): The step size for selection. A value of 0 means no stride (consecutive elements).

Outputs:

  • result (Tensor): The selected tensor.

Note: The shape of the output tensor is the same as the input tensor except for the selected dimension, which will have size length. If stride is non-zero, the elements selected will be at indices begin, begin + stride, begin + 2*stride, etc., up to length elements.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute
begin::mlir::IntegerAttr32-bit signed integer attribute
length::mlir::IntegerAttr32-bit signed integer attribute
stride::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.sigmoid (tt::ttir::SigmoidOp)

Eltwise sigmoid.

The sigmoid operation computes the sigmoid of each element in the input tensor.

For each element, it returns the sigmoid of the value.

Example:

// Compute sigmoid of all elements in %input
%result = ttir.sigmoid(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.8391, 0.9641, 0.5793, 0.9899], ... ]

Mathematical definition: sigmoid(x) = 1 / (1 + exp(-x))

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.sign (tt::ttir::SignOp)

Eltwise sign operation.

The sign operation computes the sign of each element in the input tensor.

For each element, it returns:

  • 1 if the value is positive
  • 0 if the value is zero
  • -1 if the value is negative

This operation has the idempotence property, meaning that applying it multiple times produces the same result as applying it once: sign(sign(x)) = sign(x).

Example:

// Compute sign of all elements in %input
%result = ttir.sign(%input, %output) : tensor<2x3xi32>, tensor<2x3xi32> -> tensor<2x3xi32>
// Input tensor:
// [[3, -2, 0],
//  [1, -4, 4]]
// Output tensor:
// [[1, -1, 0],
//  [1, -1, 1]]

// Example with floating-point values
%result = ttir.sign(%float_input, %float_output) : tensor<4xf32>, tensor<4xf32> -> tensor<4xf32>
// Input tensor:
// [5.7, -0.0, 0.001, -3.14]
// Output tensor:
// [1.0, 0.0, 1.0, -1.0]

Mathematical definition: sign(x) = { 1 if x > 0 0 if x = 0 -1 if x < 0 }

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Idempotence, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.sin (tt::ttir::SinOp)

Eltwise sin operation.

The sin operation computes the sine of each element in the input tensor.

For each element, it returns the sine of the angle in radians.

Example:

// Compute sine of all elements in %input
%result = ttir.sin(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.9601, 0.5403, -0.3, 4.5], ... ]

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.slice (tt::ttir::SliceOp)

Tensor slice operation.

The slice operation extracts a sub-tensor (slice) from the input tensor across one or more dimensions.

This operation selects a subset of elements from the input tensor based on the specified begin, end, and step indices for each dimension. It's similar to Python's slicing notation tensor[begin:end:step] but extended to multiple dimensions.

Example:

// Extract a 2x2 slice from a 4x4 tensor
%input = ... : tensor<4x4xf32>  // Input tensor with values:
                                // [[1,  2,  3,  4],
                                //  [5,  6,  7,  8],
                                //  [9,  10, 11, 12],
                                //  [13, 14, 15, 16]]
%output = ttir.empty() : tensor<2x2xf32>  // Output tensor shape
%result = ttir.slice(%input, %output) {
    begins = [1, 1],  // Start indices for each dimension
    ends = [3, 3],    // End indices for each dimension (exclusive)
    step = [1, 1]     // Step size for each dimension
} : tensor<4x4xf32>, tensor<2x2xf32> -> tensor<2x2xf32>
// Result:
// [[6,  7],
//  [10, 11]]

// Extract elements with a step of 2
%input = ... : tensor<5xf32>  // Input tensor with values: [1, 2, 3, 4, 5]
%output = ttir.empty() : tensor<3xf32>  // Output tensor shape
%result = ttir.slice(%input, %output) {
    begins = [0],  // Start index
    ends = [5],    // End index (exclusive)
    step = [2]     // Step size
} : tensor<5xf32>, tensor<3xf32> -> tensor<3xf32>
// Result: [1, 3, 5]

Inputs:

  • input (Tensor): The input tensor to slice.

Attributes:

  • begins (Array of Integer): The starting indices for the slice in each dimension.
  • ends (Array of Integer): The ending indices (exclusive) for the slice in each dimension.
  • step (Array of Integer): The step sizes for the slice in each dimension.

Outputs:

  • result (Tensor): The sliced tensor.

Note: The shape of the output tensor is determined by the slice parameters. For each dimension i, the output size is calculated as ceil((ends[i] - begins[i]) / step[i]). The begins, ends, and step arrays must have the same length as the rank of the input tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
begins::mlir::ArrayAttr32-bit integer array attribute
ends::mlir::ArrayAttr32-bit integer array attribute
step::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.softmax (tt::ttir::SoftmaxOp)

Softmax normalization operation.

The softmax operation applies the softmax function along a specified dimension of the input tensor.

The softmax function transforms each element of the input tensor to a value between 0 and 1, such that the sum of all elements along the specified dimension equals 1. This is commonly used to convert a vector of real numbers into a probability distribution.

The softmax function is defined as: softmax(x_i) = exp(x_i) / sum(exp(x_j)) for all j in the specified dimension

Example:

// Softmax along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2x3xf32>
%result = ttir.softmax(%input, %output) {dimension = 1 : i32} : tensor<2x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 1.0, 2.0]]
// Output tensor (approximate values):
// [[0.09, 0.24, 0.67],  // sum = 1.0
//  [0.71, 0.09, 0.20]]  // sum = 1.0

Note: For numerical stability, the implementation typically subtracts the maximum value in each slice before applying the exponential function.

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • dimension (Integer): The dimension along which to apply the softmax function.

Outputs:

  • result (Tensor): The tensor after applying the softmax function.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dimension::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.sqrt (tt::ttir::SqrtOp)

Eltwise square root.

The sqrt operation computes the square root of each element in the input tensor.

For each element, it returns the square root of the value.

Example:

// Compute square root of all elements in %input
%result = ttir.sqrt(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.5882, 0.5, -3.3333, 0.2173], ... ]

Mathematical definition: sqrt(x) = √x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.squeeze (tt::ttir::SqueezeOp)

Tensor dimension squeezing operation.

The squeeze operation removes a dimension of size 1 from the shape of a tensor.

This operation is commonly used to eliminate unnecessary singleton dimensions from a tensor's shape. It specifies which dimension to remove using the dim attribute. The specified dimension must have size 1.

Example:

// Squeeze dimension 0 from a tensor of shape [1, 3, 4]
%input = ... : tensor<1x3x4xf32>  // Input tensor with shape [1, 3, 4]
%output = ttir.empty() : tensor<3x4xf32>  // Output tensor shape
%result = ttir.squeeze(%input, %output) {
    dim = 0 : i32  // Dimension to squeeze
} : tensor<1x3x4xf32>, tensor<3x4xf32> -> tensor<3x4xf32>
// Result: tensor with shape [3, 4]

// Squeeze dimension 1 from a tensor of shape [2, 1, 3]
%input = ... : tensor<2x1x3xf32>  // Input tensor with shape [2, 1, 3]
%output = ttir.empty() : tensor<2x3xf32>  // Output tensor shape
%result = ttir.squeeze(%input, %output) {
    dim = 1 : i32  // Dimension to squeeze
} : tensor<2x1x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// Result: tensor with shape [2, 3]

Inputs:

  • input (Tensor): The input tensor to squeeze.

Attributes:

  • dim (Integer): The dimension to squeeze.

Outputs:

  • result (Tensor): The squeezed tensor.

Note: The specified dimension must have size 1. The shape of the output tensor is the same as the input tensor with the specified dimension removed. For example, squeezing dimension 1 of a tensor with shape [2, 1, 3] results in a tensor with shape [2, 3].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.stream_layout (tt::ttir::StreamLayoutOp)

Stream Layout op.

StreamLayout operation used to form a stream between remote and local memory spaces. Note that this op has no side-effects, it's purely representational. The primary use cases include, to enable streaming a large tensor out of dram via a small L1 buffer and also as a means for forming reduce or gather multicast operations. A stream definition includes:

  • The tensor to be streamed.
  • The storage buffer to be used for streaming.
  • A result, which is also able to take a view over the input, i.e. same semantics as the ViewLayout op.

Additional constraints:

  • It is not capable of changing the data type nor the memory space of the tensor.
%input = memref.alloc() {alignment = 64 : i64} : memref<2x4x4x6x!tt.tile<32x32, f32>, #l1_>
%storage = memref.alloc() {alignment = 64 : i64} : memref<2x4x1x1x!tt.tile<32x32, f32>, #l1_>
%stream = "ttir.stream_layout"(%input, %storage) : (memref<2x4x4x6x!tt.tile<32x32, f32>, #l1_>, memref<2x4x1x1x!tt.tile<32x32, f32>, #l1_>) -> memref<2x4x4x6x!tt.tile<32x32, f32>, #tt.view<map(4)>, #l1_>

Traits: AlwaysSpeculatableImplTrait

Interfaces: BufferizableOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, TTIR_ViewOpInterface

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values or non-0-ranked.memref of any type values
storageranked tensor of any type values or non-0-ranked.memref of any type values

Results:

ResultDescription
resultranked tensor of any type values or non-0-ranked.memref of any type values

ttir.subtract (tt::ttir::SubtractOp)

Elementwise subtract operation.

The subtract operation performs an elementwise subtraction between two tensors.

For each pair of corresponding elements, it subtracts the element in the second tensor from the element in the first tensor and places the result in the output tensor.

Example:

// Subtraction operation
%result = ttir.subtract(%lhs, %rhs, %output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensors:
// %lhs: [10, 20, 30]
// %rhs: [1, 2, 3]
// Output tensor:
// [9, 18, 27]

// Example with floating point values
%result = ttir.subtract(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [3.5, 0.0, -1.2]
// %float_rhs: [1.5, 2.0, -3.2]
// Output tensor:
// [2.0, -2.0, 2.0]

Note: The data type of the output tensor matches the data type of the input tensors.

Mathematical definition: subtract(x, y) = x - y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_PartiallyBroadcastable

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.sum (tt::ttir::SumOp)

Sum reduction operation.

The sum operation computes the sum of elements along specified dimensions of the input tensor.

This operation reduces the input tensor by computing the sum of all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the sum is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example:

// Sum along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2xf32>
%result = ttir.sum(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xf32>, tensor<2xf32> -> tensor<2xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [6.0, 15.0]  // Sum of each row

// Sum along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<3xf32>
%result = ttir.sum(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [5.0, 7.0, 9.0]  // Sum of each column

// Sum over all dimensions
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<f32>
%result = ttir.sum(%input, %output) {keep_dim = false} : tensor<2x3xf32>, tensor<f32> -> tensor<f32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// 21.0  // Sum of all elements

Mathematical definition: sum(x, dim) = ∑ x[i] for all i in dimension dim

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • keep_dim (Bool): Whether to keep the reduced dimensions or not.
  • dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

  • output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.tan (tt::ttir::TanOp)

Elementwise tan operation.

The tan operation computes the tangent of each element in the input tensor.

For each element, it returns the tangent of the angle in radians.

Example:

// Compute tangent of all elements in %input
%result = ttir.tan(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.9601, 0.5403, -0.3, 4.5], ... ]

Mathematical definition: tan(x) = sin(x) / cos(x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.tanh (tt::ttir::TanhOp)

Elementwise hyperbolic tangent operation.

The tanh operation computes the hyperbolic tangent of each element in the input tensor.

For each element, it returns the hyperbolic tangent of the value.

Example:

// Compute hyperbolic tangent of all elements in %input
%result = ttir.tanh(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.9601, 0.5403, -0.3, 4.5], ... ]

Mathematical definition: tanh(x) = (e^x - e^-x) / (e^x + e^-x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.to_layout (tt::ttir::ToLayoutOp)

Layout op.

Syntax:

operation ::= `ttir.to_layout` $input `,` $output `:` type($input) `into` type($output) (`hostInfo` `=` $layout^)? attr-dict (`->` type($results)^)?

ToLayout operation, transition tensors from one layout to another. Some examples include:

  • Transitioning between different memory spaces, e.g. DRAM to L1.
  • Transitioning between different data types, e.g. f32 to f16.
  • Transitioning between different tile sizes, e.g. 1x16 to 32x32
  • Transitioning between different tensor sharding
  • Some combination of the above
#layout = #tt.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #system>>
#layout1 = #tt.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #l1_>>
%1 = "ttir.to_layout"(%arg0, %0) : (tensor<64x128xf32, #layout>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>

Interfaces: BufferizableOpInterface, DestinationStyleOpInterface, MemoryEffectOpInterface, TTIROpInterface

Attributes:

AttributeMLIR TypeDescription
layout::mlir::tt::MetalLayoutAttr
Tensor layout attribute{{% markdown %}} The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.
Some high level goals
  - **Logical shapes**: Keep the original tensor shape and rank intact and agnostic
    to underlying storage layout.
    Keeping the logical shapes not only makes some graph transformations vastly
    simpler, in particular convs, but it makes the lowered IR much easier to read
    and reason about.  The original tensor shapes leave breadcrumbs that make it
    much easier to map back to the input representation.
  - **Flexible sharding**: Enable flexibility in choosing grid shape, to get better
    parallelization and avoid resharding. This is particularly important in cases
    where tensor shapes are not clean powers of two and would otherwise force our
    hand in choosing non-optimal grid shapes.
  - **Logical-Physical Isomorphism**: Encode this information with just a few
    attributes to enable derived conversions from logical to physical layout and back.
  - **Explicit**: A single source of truth.
  - Enable a direct way to query padded regions.

Please refer to the [Tensor Layout Spec](https://tenstorrent.github.io/tt-mlir/specs/tensor-layout.html) for more in depth documentation.

Examples:
```mlir
tensor<8x300xf32,
  #tt.metal_layout<(d0, d1) -> (d0, d1),
    undef,
    <1x2>,
    memref<8x150xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.metal_layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
    undef,
    <2x1>,
    memref<384x32xf32, #tt.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #tt.metal_layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
    undef,
    <2x1x2>,
    memref<384x96x16xf32, #tt.memory_space<l1>>
  >
>

tensor<5x3x2x2x7x32x32xf32,
  #tt.metal_layout<
    (d0, d1, d2, d3, d4, d5, d6)
      -> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
    undef,
    <3x2x2x2>,
    memref<4480x4x16x16xf32, #tt.memory_space<l1>>
  >
>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values or non-0-ranked.memref of any type values
outputranked tensor of any type values or non-0-ranked.memref of any type values

Results:

ResultDescription
resultsvariadic of ranked tensor of any type values

ttir.transpose (tt::ttir::TransposeOp)

Tensor transpose operation.

The transpose operation swaps two dimensions of a tensor.

This operation exchanges the positions of two specified dimensions in the input tensor, effectively transposing those dimensions. The shape of the output tensor is the same as the input tensor, except that the dimensions specified by dim0 and dim1 are swapped.

Example:

// Transpose dimensions 0 and 1
%input = ... : tensor<2x3x4xf32>
%output = ttir.empty() : tensor<3x2x4xf32>
%result = ttir.transpose(%input, %output) {dim0 = 0 : i32, dim1 = 1 : i32} :
    tensor<2x3x4xf32>, tensor<3x2x4xf32> -> tensor<3x2x4xf32>
// Input tensor shape: [2, 3, 4]
// Output tensor shape: [3, 2, 4]

// Transpose dimensions 1 and 2
%input = ... : tensor<2x3x4xf32>
%output = ttir.empty() : tensor<2x4x3xf32>
%result = ttir.transpose(%input, %output) {dim0 = 1 : i32, dim1 = 2 : i32} :
    tensor<2x3x4xf32>, tensor<2x4x3xf32> -> tensor<2x4x3xf32>
// Input tensor shape: [2, 3, 4]
// Output tensor shape: [2, 4, 3]

Inputs:

  • input (Tensor): The input tensor.

Attributes:

  • dim0 (Integer): The first dimension to swap.
  • dim1 (Integer): The second dimension to swap.

Outputs:

  • result (Tensor): The transposed tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_TensorManipulation

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim0::mlir::IntegerAttr32-bit signed integer attribute
dim1::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.typecast (tt::ttir::TypecastOp)

Elementwise type casting operation.

The typecast operation converts each element in the input tensor to a different data type.

This operation performs element-wise type conversion, such as converting from integers to floating-point values or between different floating-point precisions. The conversion follows the standard type conversion rules for the target platform.

Example:

// Cast from int32 to float32
%result = ttir.typecast(%input, %output) : tensor<4x4xi32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1, 2, -3, 4], ... ]
// Output tensor:
// [[1.0, 2.0, -3.0, 4.0], ... ]

// Cast from float32 to int32
%result = ttir.typecast(%float_input, %int_output) : tensor<3xf32>, tensor<3xi32> -> tensor<3xi32>
// Input tensor:
// [1.7, -2.3, 3.0]
// Output tensor:
// [1, -2, 3]  // Note: truncation, not rounding

// Cast from float32 to float64 (higher precision)
%result = ttir.typecast(%f32_input, %f64_output) : tensor<2xf32>, tensor<2xf64> -> tensor<2xf64>
// Input tensor:
// [3.14159, 2.71828]
// Output tensor:
// [3.14159, 2.71828]  // Same values but with higher precision

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
conservative_folding::mlir::BoolAttrbool attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.unsqueeze (tt::ttir::UnsqueezeOp)

Tensor dimension insertion operation.

The unsqueeze operation inserts a dimension of size 1 into the shape of a tensor.

This operation is the inverse of the squeeze operation and is commonly used to add a singleton dimension to a tensor's shape. It specifies which position to insert the new dimension using the dim attribute.

Example:

// Insert a dimension at position 0 of a tensor with shape [3, 4]
%input = ... : tensor<3x4xf32>  // Input tensor with shape [3, 4]
%output = ttir.empty() : tensor<1x3x4xf32>  // Output tensor shape
%result = ttir.unsqueeze(%input, %output) {
    dim = 0 : i32  // Position to insert the new dimension
} : tensor<3x4xf32>, tensor<1x3x4xf32> -> tensor<1x3x4xf32>
// Result: tensor with shape [1, 3, 4]

// Insert a dimension at position 1 of a tensor with shape [2, 3]
%input = ... : tensor<2x3xf32>  // Input tensor with shape [2, 3]
%output = ttir.empty() : tensor<2x1x3xf32>  // Output tensor shape
%result = ttir.unsqueeze(%input, %output) {
    dim = 1 : i32  // Position to insert the new dimension
} : tensor<2x3xf32>, tensor<2x1x3xf32> -> tensor<2x1x3xf32>
// Result: tensor with shape [2, 1, 3]

Inputs:

  • input (Tensor): The input tensor to unsqueeze.

Attributes:

  • dim (Integer): The position to insert the new dimension.

Outputs:

  • result (Tensor): The unsqueezed tensor.

Note: The shape of the output tensor is the same as the input tensor with a new dimension of size 1 inserted at the specified position. For example, unsqueezing at position 1 of a tensor with shape [2, 3] results in a tensor with shape [2, 1, 3].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.update_cache (tt::ttir::UpdateCacheOp)

Cache update operation.

The update_cache operation updates a cache tensor with values from an input tensor at specific indices.

This operation is commonly used in sequence models like transformers to update a key-value cache with new token information. It takes a cache tensor, an input tensor, and update indices, and updates the cache at the specified positions.

Example:

// Update cache at specific indices
%cache = ... : tensor<2x16x64xf32>  // Batch size 2, sequence length 16, hidden dim 64
%input = ... : tensor<2x1x64xf32>   // New token embeddings
%update_index = ... : tensor<1xi32> // Update at position [15]
%result = ttir.update_cache(%cache, %input, %update_index) {batch_offset = 0 : i32} :
    tensor<2x16x64xf32>, tensor<2x1x64xf32>, tensor<1xi32> -> tensor<2x16x64xf32>
// The cache tensor is updated at position 15 for both batches with the values from input

Inputs:

  • cache (Tensor): The cache tensor to be updated.
  • input (Tensor): The input tensor containing new values.
  • update_index (Tensor): Indices specifying where to update the cache.

Attributes:

  • batch_offset (Integer): Offset in the batch dimension.

Outputs:

  • result (Tensor): The updated cache tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
batch_offset::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
cacheranked tensor of any type values
inputranked tensor of any type values
update_indexranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.upsample2d (tt::ttir::Upsample2dOp)

Upsample 2D operation.

The upsample2d operation increases the spatial dimensions (height and width) of an input tensor.

This operation is commonly used in neural networks to increase the spatial resolution of feature maps. It supports different upsampling algorithms such as "nearest" and "bilinear" interpolation. The input tensor is assumed to be in NHWC format (batch, height, width, channels).

Example:

// Upsample a tensor with different scale factors for height and width
%input = ... : tensor<10x64x32x3xbf16>  // Input tensor: [batch=10, height=64, width=32, channels=3]
%output = ttir.empty() : tensor<10x128x128x3xbf16>  // Output tensor shape
%result = ttir.upsample2d(%input, %output) {
    scale_factor = [2, 4],  // Scale height by 2, width by 4
    mode = "bilinear"       // Use bilinear interpolation
} : tensor<10x64x32x3xbf16>, tensor<10x128x128x3xbf16> -> tensor<10x128x128x3xbf16>
// Result: tensor with shape [10,128,128,3]

// Upsample with the same scale factor for both dimensions
%input = ... : tensor<1x32x32x16xf32>  // Input tensor
%output = ttir.empty() : tensor<1x64x64x16xf32>  // Output tensor shape
%result = ttir.upsample2d(%input, %output) {
    scale_factor = 2,     // Scale both height and width by 2
    mode = "nearest"      // Use nearest neighbor interpolation
} : tensor<1x32x32x16xf32>, tensor<1x64x64x16xf32> -> tensor<1x64x64x16xf32>
// Result: tensor with shape [1,64,64,16]

Inputs:

  • input (Tensor): The input tensor to upsample, in NHWC format.

Attributes:

  • scale_factor (Integer or Array of Integer): The scale factor for upsampling in height and width dimensions. If a single integer is provided, it's used for both dimensions. If an array is provided, the first value is used for height and the second for width.
  • mode (String, default="nearest"): The upsampling algorithm to use. Currently supported values are "nearest" for nearest neighbor interpolation and "bilinear" for bilinear interpolation.

Outputs:

  • result (Tensor): The upsampled tensor.

Note: The output height is calculated as input_height * scale_factor[0] and the output width as input_width * scale_factor[1]. The batch and channel dimensions remain unchanged.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
scale_factor::mlir::Attribute32-bit signed integer attribute or i32 dense array attribute
mode::mlir::StringAttrstring attribute

Operands:

OperandDescription
inputranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.view_layout (tt::ttir::ViewLayoutOp)

View Layout op.

ViewLayout operation, used to take a view of one layout into another. Note that this op is purely representational and doesn't have any side-effects. Its primary usecase is to allow reinterpreting the layout of a tensor without actually moving the data. Consumers of this op are expected to compose the layout with the underlying backing layout.

Additional notes/constraints:

  • It is not capable of changing the data type nor the memory space of the tensor.
  • If reinterpretLayout is true, the layout view change can include a data type cast, but note this does not actually change the format of the data in memory.
  • All ViewLayout ops can trivially be converted to ToLayout ops.
#layout = #tt.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #system>>
#layout1 = #tt.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #l1_>>
%1 = "ttir.view_layout"(%arg0, %0) : (tensor<64x128xf32, #layout>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>

Traits: AlwaysSpeculatableImplTrait

Interfaces: BufferizableOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, TTIROpInterface, TTIR_ViewOpInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
reinterpretLayout::mlir::BoolAttrbool attribute

Operands:

OperandDescription
inputranked tensor of any type values or non-0-ranked.memref of any type values

Results:

ResultDescription
resultranked tensor of any type values or non-0-ranked.memref of any type values

ttir.where (tt::ttir::WhereOp)

Elementwise conditional selection operation based on a predicate.

The where operation performs element-wise conditional selection based on a predicate.

For each element position, it selects between two values based on a boolean condition in first tensor:

  • If the condition is true (non-zero), it selects the corresponding element from the second tensor
  • If the condition is false (zero), it selects the corresponding element from the third tensor

This operation supports broadcasting, allowing inputs of different shapes to be combined according to standard broadcasting rules.

Example:

// Select elements from %true_values where %condition is true,
// otherwise select from %false_values
%result = ttir.where(%condition, %true_values, %false_values, %output) : tensor<4x4xi1>, tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>

// With broadcasting (condition is a scalar)
%result = ttir.where(%scalar_condition, %true_values, %false_values, %output) : tensor<1xi1>, tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>

This operation is equivalent to the ternary conditional operator (condition ? true_value : false_value) in many programming languages, applied elementwise across tensors.

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseTernary, TTIR_PartiallyBroadcastable

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
firstranked tensor of any type values
secondranked tensor of any type values
thirdranked tensor of any type values
outputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttir.zeros (tt::ttir::ZerosOp)

Creates a tensor filled with zeros.

The zeros operation creates a tensor filled with zeros of the specified shape.

This operation is commonly used to initialize tensors with zero values. It takes a shape attribute and produces a tensor of that shape with all elements set to zero.

Example:

// Create a 3D tensor of zeros with shape [64, 28, 28]
%result = ttir.zeros() {
    shape = [64, 28, 28]
} : () -> tensor<64x28x28xbf16>
// Result: A tensor of shape [64, 28, 28] filled with zeros

// Create a 2D tensor of zeros with shape [3, 4]
%result = ttir.zeros() {
    shape = [3, 4]
} : () -> tensor<3x4xf32>
// Result: [[0.0, 0.0, 0.0, 0.0],
//          [0.0, 0.0, 0.0, 0.0],
//          [0.0, 0.0, 0.0, 0.0]]

Attributes:

  • shape (Array of Integer): The shape of the tensor to create.

Outputs:

  • result (Tensor): The tensor filled with zeros.

Note: The element type of the result tensor is determined by the return type specified in the operation. This operation is useful for initializing tensors before filling them with computed values or as a starting point for accumulation operations.

Traits: AlwaysSpeculatableImplTrait, TT_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shape::mlir::DenseI32ArrayAttri32 dense array attribute

Results:

ResultDescription
resultranked tensor of any type values

'ttkernel' Dialect

A TTKernel out-of-tree MLIR dialect.

This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

ArgAttr

Kernel argument.

Syntax:

#ttkernel.arg<
  ArgType,   # arg_type
  size_t,   # operand_index
  bool   # is_uniform
>

Parameters:

ParameterC++ typeDescription
arg_typeArgType
operand_indexsize_t
is_uniformbool

ArgSpecAttr

Kernel argument specification.

Syntax:

#ttkernel.arg_spec<
  ::llvm::ArrayRef<ArgAttr>,   # rt_args
  ::llvm::ArrayRef<ArgAttr>   # ct_args
>

A list of argument attibutes, of which form the argument specification for this kernel.

Parameters:

ParameterC++ typeDescription
rt_args::llvm::ArrayRef<ArgAttr>
ct_args::llvm::ArrayRef<ArgAttr>

ReduceDimAttr

TTKernel Reduce Dimensions

Syntax:

#ttkernel.reduce_dim<
  ::mlir::tt::ttkernel::ReduceDim   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::ttkernel::ReduceDiman enum of type ReduceDim

ReduceTypeAttr

TTKernel Reduce Types

Syntax:

#ttkernel.reduce_type<
  ::mlir::tt::ttkernel::ReduceType   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::ttkernel::ReduceTypean enum of type ReduceType

ThreadTypeAttr

TTKernel ThreadTypes

Syntax:

#ttkernel.thread<
  ::mlir::tt::ttkernel::ThreadType   # value
>

Parameters:

ParameterC++ typeDescription
value::mlir::tt::ttkernel::ThreadTypean enum of type ThreadType

ttkernel.add_tiles_init (tt::ttkernel::AddTilesInitOp)

Short init function

Must be run before add_tiles.

Traits: TTKernel_InitOpTrait

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb

ttkernel.add_tiles (tt::ttkernel::AddTilesOp)

Add operation

Performs element-wise addition C=A+B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_BinaryOpTrait, TTKernel_FPUOpTrait

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
in0_tile_indexindex or 32-bit signless integer
in1_tile_indexindex or 32-bit signless integer
dst_indexindex or 32-bit signless integer

ttkernel.binary_op_init_common (tt::ttkernel::BinaryOpInitCommonOp)

Init function for all binary ops

Followed by the specific init required with an opcode (binrary_op_specific_init).

Traits: TTKernel_InitOpTrait

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
out_cbTTKernel cb

ttkernel.cb_pop_front (tt::ttkernel::CBPopFrontOp)

CBPopFront call.

CBPopFront operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.cb_push_back (tt::ttkernel::CBPushBackOp)

CBPushBack call.

CBPushBack operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.cb_reinterpret_shape (tt::ttkernel::CBReinterpretShapeOp)

Get the data format of a given CB

get_dataformat operation

Operands:

OperandDescription
inputTTKernel cb

Results:

ResultDescription
outputTTKernel cb

ttkernel.cb_reserve_back (tt::ttkernel::CBReserveBackOp)

CBReserveBack call.

CBReserveBack operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.cb_wait_front (tt::ttkernel::CBWaitFrontOp)

CBWaitFront call.

CBWaitFront operation

Operands:

OperandDescription
cbTTKernel cb
numPages32-bit signless integer

ttkernel.reinterpret_cast<volatile tt_l1_ptr uint32_t*> (tt::ttkernel::CastToL1PtrOp)

CastToL1Ptr

Cast specified addr to L1 pointer.

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
addr32-bit signless integer or TTKernel l1 address or TTKernel semaphore

Results:

ResultDescription
l1_ptrTTKernel l1 address pointer

ttkernel.ceil_tile_float32 (tt::ttkernel::CeilTileF32Op)

Ceil f32 tile in the DST at specified index.

Performs element-wise computation of ceil operation DST[dst0_index] <- ceil(DST[dst0_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

OperandDescription
dst0_indexindex or 32-bit signless integer

ttkernel.ceil_tile (tt::ttkernel::CeilTileOp)

Ceil tile in the DST at specified index.

Performs element-wise computation of ceil operation DST[dst0_index] <- ceil(DST[dst0_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

OperandDescription
dst0_indexindex or 32-bit signless integer

ttkernel.copy_tile_init (tt::ttkernel::CopyTileInitOp)

Perform the init for copy tile. This does not reconfigure the unpacker data types.

Must be called before copy_tile.

Traits: TTKernel_InitOpTrait

Operands:

OperandDescription
cb0TTKernel cb

ttkernel.copy_tile (tt::ttkernel::CopyTileOp)

Copy tile from specified CB to DST.

Copies a single tile from the specified input CB and writes the result to DST at a specified index. The function will employ unpacker to first unpack into SRC registers and then perform move into DST registers, at a specified index. For the in_tile_index to be valid for this call, cb_wait_front(n) had to be previously called to ensure that at least some number n>0 of tiles are available in the input CB. The CB index 0 then references the first tile in the received section of the CB, up to index n-1 (in a FIFO order). The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Operands:

OperandDescription
cb0TTKernel cb
tile_index_cbindex or 32-bit signless integer
tile_index_dstindex or 32-bit signless integer

ttkernel.cos_tile_init (tt::ttkernel::CosTileInitOp)

Short init function which configures compute unit for execution of cos_tile.

Must be run before cos_tile.

Traits: TTKernel_InitOpTrait

ttkernel.cos_tile (tt::ttkernel::CosTileOp)

Cos operation

Performs element-wise computation of the trigonometric cosine operation on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

OperandDescription
tile_indexindex or 32-bit signless integer

ttkernel.dprint (tt::ttkernel::DPrintOp)

Print to output stream from kernel.

Syntax:

operation ::= `ttkernel.dprint` `(` $fmt `,` $argv `)` attr-dict `:` `(` type($argv) `)`

std::format style format string:

rewriter.create<ttkernel::DPrintOp>(loc, "nocY={} nocX={} addr={}\\n",
                                  nocY, nocX, addr);
ttkernel.dprint("virtY {} virtX {} addr {}\\n", %14, %15, %13) : (index, index, i32)

Notes:

  • Only trivial format specifier currently supported, i.e. {}.
  • Must double escape newline character or other special characters.

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Attributes:

AttributeMLIR TypeDescription
fmt::mlir::StringAttrstring attribute

Operands:

OperandDescription
argvvariadic of any type

ttkernel.div_binary_tile_init (tt::ttkernel::DivBinaryTilesInitOp)

Short init function

Must be run before div_binary_tile.

Traits: TTKernel_InitOpTrait

ttkernel.div_binary_tile (tt::ttkernel::DivBinaryTilesOp)

Divide operation between two tiles

Performs element-wise computation of division operation DST[dst0_index] <- DST[dst0_index] / DST[dst1_index] on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_BinaryOpTrait, TTKernel_SFPUOpTrait

Operands:

OperandDescription
dst0_indexindex or 32-bit signless integer
dst1_indexindex or 32-bit signless integer

ttkernel.exp_tile_init (tt::ttkernel::ExpTileInitOp)

Short init function which configures compute unit for execution of exp_tile.

Must be run before exp_tile.

Traits: TTKernel_InitOpTrait

ttkernel.exp_tile (tt::ttkernel::ExpTileOp)

Exp operation

Performs element-wise computation of exponential on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

OperandDescription
tile_indexindex or 32-bit signless integer

ttkernel.experimental::tilize_block (tt::ttkernel::ExperimentalTilizeBlockOp)

Experimental TilizeBlockOp call.

This is a custom tilize block LLK that takes the dimensions of the block, and properly tilizes each row.

Operands:

OperandDescription
cbInTTKernel cb
cbOutTTKernel cb
blockR32-bit signless integer
blockC32-bit signless integer

ttkernel.experimental::untilize_block (tt::ttkernel::ExperimentalUntilizeBlockOp)

Experimental UntilizeBlockOp call.

This is a custom untilize block LLK that takes the dimensions of the block.

Operands:

OperandDescription
cbInTTKernel cb
cbOutTTKernel cb
blockR32-bit signless integer
blockC32-bit signless integer

ttkernel.fill_tile_init (tt::ttkernel::FillTileInitOp)

Init function for fill_tile operation. Refer to documentation for any init function.

Must be run before fill_tile.

Traits: TTKernel_InitOpTrait

ttkernel.fill_tile (tt::ttkernel::FillTileOp)

Fill tile with specified value.

Fills supplied DST register tile with a supplied f32 value. The DST register must be in acquired state via tile_regs_acquire call.

Example:

ttkernel.fill_tile(%dst_index, %value);

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

OperandDescription
dst_indexindex or 32-bit signless integer
value32-bit float

ttkernel.get_arg_val (tt::ttkernel::GetArgValOp)

Get runtime arg value.

Get runtime argument value at specified index.

Operands:

OperandDescription
arg_indexindex or 32-bit signless integer

Results:

ResultDescription
arg_val32-bit signless integer or TTKernel cb or TTKernel l1 address

ttkernel.get_compile_time_arg_val (tt::ttkernel::GetCompileArgValOp)

Get compile-time arg value.

Get compile-time argument value at specified index.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
arg_index::mlir::IntegerAttr32-bit signless integer attribute

Results:

ResultDescription
arg_val32-bit signless integer or TTKernel cb or TTKernel l1 address

ttkernel.get_dataformat (tt::ttkernel::GetDataFormatOp)

Get the data format of a given CB

get_dataformat operation

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
cbTTKernel cb

Results:

ResultDescription
dataFormatTTKernel compute data format type

ttkernel.get_interleaved_addr_gen_fast (tt::ttkernel::GetInterleavedAddrGenFastOp)

GetInterleavedAddrGenFastOp

Returns an InterleavedAddrGenFast type.

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
DRAM1-bit signless integer
bank_base_address32-bit signless integer
page_size32-bit signless integer
data_formatTTKernel compute data format type

Results:

ResultDescription
resultTTKernel InterleavedAddrGenFast type

ttkernel.get_noc_addr_from_bank_id (tt::ttkernel::GetNocAddrFromBankIDOp)

GetNocAddrFromBankID

GetNocAddrFromBankID api

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
bank_id32-bit signless integer
bankAddressOffset32-bit signless integer

Results:

ResultDescription
nocAddrTTKernel noc address

ttkernel.get_noc_addr (tt::ttkernel::GetNocAddrOp)

GetNocAddr

GetNocAddr api including core coordinates

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
xindex or 32-bit signless integer
yindex or 32-bit signless integer
l1Address32-bit signless integer or TTKernel l1 address or TTKernel semaphore

Results:

ResultDescription
nocAddrTTKernel noc address

ttkernel.get_noc_multicast_addr (tt::ttkernel::GetNocMulticastAddrOp)

GetNocMulticastAddr

GetNocMulticastAddr

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
noc_x_startindex or 32-bit signless integer
noc_y_startindex or 32-bit signless integer
noc_x_endindex or 32-bit signless integer
noc_y_endindex or 32-bit signless integer
addr32-bit signless integer or TTKernel l1 address or TTKernel semaphore
noc8-bit signless integer

Results:

ResultDescription
mcastNocAddrTTKernel noc address

ttkernel.get_read_ptr (tt::ttkernel::GetReadPtrOp)

GetReadPtr

GetReadPtr operation

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
cbTTKernel cb

Results:

ResultDescription
readPtr32-bit signless integer

ttkernel.get_semaphore (tt::ttkernel::GetSemaphoreOp)

GetSemaphoreOp

Get L1 addr of the semaphore with specified semaphore id

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
semaphoreindex or 32-bit signless integer

Results:

ResultDescription
sem_addrTTKernel semaphore

ttkernel.get_tile_size (tt::ttkernel::GetTileSizeOp)

Get the tile size in bytes of a given CB

get_tile_size operation

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
cbTTKernel cb

Results:

ResultDescription
tileSizeBytes32-bit signless integer

ttkernel.get_write_ptr (tt::ttkernel::GetWritePtrOp)

GetWritePtr

GetWritePtr operation

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
cbTTKernel cb

Results:

ResultDescription
writePtr32-bit signless integer

ttkernel.init_sfpu (tt::ttkernel::InitSFPUOp)

Initialization function for SFPU operations.

This operation initializes all necessary components for SFPU operations, including unpacking, packing, and math configurations.

Traits: TTKernel_InitOpTrait

Operands:

OperandDescription
icbTTKernel cb
ocbTTKernel cb

ttkernel.mm_init (tt::ttkernel::MatmulInitOp)

Matmul init function

Can only be run ONCE per kernel. Should be run before matmul.

Traits: TTKernel_InitOpTrait

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
out_cbTTKernel cb
transpose32-bit signless integer

ttkernel.mm_init_short (tt::ttkernel::MatmulInitShortOp)

Matmul short init function

Can be run MULTIPLE times per kernel. Should be run before matmul. Use this if some other init was called between mm_init and matmul_tiles. (i.e. in a loop)

Traits: TTKernel_InitOpTrait

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
transpose32-bit signless integer

ttkernel.matmul_tiles (tt::ttkernel::MatmulTilesOp)

Matmul tiles operation

Performs tile-sized matrix multiplication C=A*B between the tiles in two specified input CBs and writes the result to DST. The DST register buffer must be in acquired state via ttkernel.tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_FPUOpTrait, TTKernel_TernaryOpTrait

Operands:

OperandDescription
in0_cb_idTTKernel cb
in1_cb_idTTKernel cb
in0_tile_idxindex or 32-bit signless integer
in1_tile_idxindex or 32-bit signless integer
dst_tile_idxindex or 32-bit signless integer
transpose32-bit signless integer

ttkernel.max_tile_init (tt::ttkernel::MaxTilesInitOp)

Short init function

Must be run before max_tile.

Traits: TTKernel_InitOpTrait

ttkernel.max_tile (tt::ttkernel::MaxTilesOp)

Max operation

Performs element-wise computation of maximum operation DST[dst0_index] <- max(DST[dst0_index], DST[dst1_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_BinaryOpTrait, TTKernel_SFPUOpTrait

Operands:

OperandDescription
dst0_indexindex or 32-bit signless integer
dst1_indexindex or 32-bit signless integer

ttkernel.mem_zeros_base (tt::ttkernel::MemZerosBaseOp)

Op corresponding to MEM_ZEROS_BASE macro in kernels.

Op corresponding to MEM_ZEROS_BASE macro in kernels.

Interfaces: InferTypeOpInterface

Results:

ResultDescription
result32-bit signless integer

ttkernel.mem_zeros_size (tt::ttkernel::MemZerosSizeOp)

Op corresponding to MEM_ZEROS_SIZE macro in kernels.

Op corresponding to MEM_ZEROS_SIZE macro in kernels.

Interfaces: InferTypeOpInterface

Results:

ResultDescription
result32-bit signless integer

ttkernel.mul_tiles_init (tt::ttkernel::MulTilesInitOp)

Short init function

Must be run before mul_tiles.

Traits: TTKernel_InitOpTrait

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb

ttkernel.mul_tiles (tt::ttkernel::MulTilesOp)

Mul operation

Performs element-wise multiplication C=A*B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_BinaryOpTrait, TTKernel_FPUOpTrait

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
in0_tile_indexindex or 32-bit signless integer
in1_tile_indexindex or 32-bit signless integer
dst_indexindex or 32-bit signless integer

ttkernel.my_x (tt::ttkernel::MyXOp)

MyX

Lowers to the tt-metal supported MY_X macro. This represents the virtual X coordinate of the current core.

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
noc8-bit signless integer

Results:

ResultDescription
xindex

ttkernel.my_y (tt::ttkernel::MyYOp)

MyY

Lowers to the tt-metal supported MY_Y macro. This represents the virtual Y coordinate of the current core.

Interfaces: InferTypeOpInterface

Operands:

OperandDescription
noc8-bit signless integer

Results:

ResultDescription
yindex

ttkernel.negative_tile_init (tt::ttkernel::NegativeTileInitOp)

Short init function which configures compute unit for execution of negative_tile.

Must be run before negative_tile.

Traits: TTKernel_InitOpTrait

ttkernel.negative_tile (tt::ttkernel::NegativeTileOp)

Negative operation

Performs element-wise computation of the negative on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

OperandDescription
tile_indexindex or 32-bit signless integer

ttkernel.noc_async_read_barrier (tt::ttkernel::NocAsyncReadBarrierOp)

NocAsyncReadBarrier

NocAsyncReadBarrier

ttkernel.noc_async_read_one_packet_set_state (tt::ttkernel::NocAsyncReadOnePacketSetStateOp)

NocAsyncReadOnePacketSetState

NocAsyncReadOnePacketSetState

Operands:

OperandDescription
srcNocAddrTTKernel noc address
size32-bit signless integer

ttkernel.noc_async_read_one_packet_with_state (tt::ttkernel::NocAsyncReadOnePacketWithStateOp)

NocAsyncReadOnePacketWithState

NocAsyncReadOnePacketWithState

Operands:

OperandDescription
srcNocAddrTTKernel noc address
dstLocalL1Addr32-bit signless integer or TTKernel l1 address

ttkernel.noc_async_read (tt::ttkernel::NocAsyncReadOp)

NocAsyncRead

NocAsyncRead

Operands:

OperandDescription
srcNocAddrTTKernel noc address
dstLocalL1Addr32-bit signless integer
size32-bit signless integer

ttkernel.noc_async_read_tile (tt::ttkernel::NocAsyncReadTileOp)

NocAsyncReadTile

NocAsyncReadTile

Operands:

OperandDescription
id32-bit signless integer
addrGenStructTTKernel InterleavedAddrGenFast type
dstLocalL1Addr32-bit signless integer

ttkernel.noc_async_write_barrier (tt::ttkernel::NocAsyncWriteBarrierOp)

NocAsyncWriteBarrier

NocAsyncWriteBarrier

ttkernel.noc_async_write_multicast_loopback_src (tt::ttkernel::NocAsyncWriteMulticastLoopbackSrcOp)

NocAsyncWriteMulticastLoopbackSrc

NocAsyncWriteMulticastLoopbackSrc

Attributes:

AttributeMLIR TypeDescription
linked::mlir::BoolAttrbool attribute
multicast_path_reserve::mlir::BoolAttrbool attribute

Operands:

OperandDescription
srcLocalL1Addr32-bit signless integer
dstNocAddrMulticastTTKernel noc address
size32-bit signless integer
num_dests32-bit signless integer
noc8-bit signless integer

ttkernel.noc_async_write_multicast_one_packet (tt::ttkernel::NocAsyncWriteMulticastOnePacketOp)

NocAsyncWriteMulticastOnePacket

NocAsyncWriteMulticastOnePacket this issues only a single packet with size <= NOC_MAX_BURST_SIZE (ie maximum packet size)

Attributes:

AttributeMLIR TypeDescription
linked::mlir::BoolAttrbool attribute
multicast_path_reserve::mlir::BoolAttrbool attribute

Operands:

OperandDescription
srcLocalL1Addr32-bit signless integer
dstNocAddrMulticastTTKernel noc address
size32-bit signless integer
num_dests32-bit signless integer
noc8-bit signless integer

ttkernel.noc_async_write_multicast (tt::ttkernel::NocAsyncWriteMulticastOp)

NocAsyncWriteMulticast

Initiates an asynchronous write from a source address in L1 memory on the Tensix core executing this function call to a rectangular destination grid. The destinations are specified using a uint64_t encoding referencing an on-chip grid of nodes located at NOC coordinate range (x_start,y_start,x_end,y_end) and a local address created using get_noc_multicast_addr function. Also, see noc_async_write_barrier.

The destination nodes can only be a set of Tensix cores + L1 memory address. The destination nodes must form a rectangular grid. The destination L1 memory address must be the same on all destination nodes.

With this API, the multicast sender cannot be part of the multicast destinations. If the multicast sender has to be in the multicast destinations (i.e. must perform a local L1 write), the other API variant noc_async_write_multicast_loopback_src can be used.

Note: The number of destinations needs to be non-zero. Besides that, there is no restriction on the number of destinations, i.e. the multicast destinations can span the full chip. However, as mentioned previously, the multicast source cannot be part of the destinations. So, the maximum number of destinations is 119.

Attributes:

AttributeMLIR TypeDescription
linked::mlir::BoolAttrbool attribute
multicast_path_reserve::mlir::BoolAttrbool attribute

Operands:

OperandDescription
srcLocalL1Addr32-bit signless integer
dstNocAddrMulticastTTKernel noc address
size32-bit signless integer
num_dests32-bit signless integer
noc8-bit signless integer

ttkernel.noc_async_write (tt::ttkernel::NocAsyncWriteOp)

NocAsyncWrite

NocAsyncWrite

Operands:

OperandDescription
srcLocalL1Addr32-bit signless integer
dstNocAddrTTKernel noc address
size32-bit signless integer

ttkernel.noc_async_write_tile (tt::ttkernel::NocAsyncWriteTileOp)

NocAsyncWriteTile

NocAsyncWriteTilie

Operands:

OperandDescription
idindex or 32-bit signless integer
addrGenStructTTKernel InterleavedAddrGenFast type
srcLocalL1Addr32-bit signless integer

ttkernel.noc_semaphore_inc (tt::ttkernel::NocSemaphoreIncOp)

NocSemaphoreInc

The Tensix core executing this function call initiates an atomic increment (with 32-bit wrap) of a remote Tensix core L1 memory address. This L1 memory address is used as a semaphore of size 4 Bytes, as a synchronization mechanism.

Operands:

OperandDescription
addrTTKernel noc address
incrindex or 32-bit signless integer
noc_id8-bit signless integer

ttkernel.noc_semaphore_set_multicast_loopback_src (tt::ttkernel::NocSemaphoreSetMulticastLoopbackOp)

NocSemaphoreSetMulticastLoopback

Initiates an asynchronous write from a source address in L1 memory on the Tensix core executing this function call to a rectangular destination grid. The destinations are specified using a uint64_t encoding referencing an on-chip grid of nodes located at NOC coordinate range (x_start,y_start,x_end,y_end) and a local address created using get_noc_multicast_addr function. The size of data that is sent is 4 Bytes. This is usually used to set a semaphore value at the destination nodes, as a way of a synchronization mechanism. The same as noc_async_write_multicast with preset size of 4 Bytes. Note: With this API, sending data only to the source node (when num_dests is 1) may result in unexpected behaviour. For some parameters, hangs have been observed. For some other parameters, nothing may happen. Consider using regular non multicast operations such as noc_async_write in this case.

Attributes:

AttributeMLIR TypeDescription
linked::mlir::BoolAttrbool attribute
multicast_path_reserve::mlir::BoolAttrbool attribute

Operands:

OperandDescription
src_local_l1_addrTTKernel semaphore
dst_noc_addr_multicastTTKernel noc address
num_dests32-bit signless integer

ttkernel.noc_semaphore_set_multicast (tt::ttkernel::NocSemaphoreSetMulticastOp)

NocSemaphoreSetMulticast

Initiates an asynchronous write from a source address in L1 memory on the Tensix core executing this function call to a rectangular destination grid. The destinations are specified using a uint64_t encoding referencing an on-chip grid of nodes located at NOC coordinate range (x_start,y_start,x_end,y_end) and a local address created using get_noc_multicast_addr function. The size of data that is sent is 4 Bytes. This is usually used to set a semaphore value at the destination nodes, as a way of a synchronization mechanism. The same as noc_async_write_multicast with preset size of 4 Bytes. With this API, the multicast sender cannot be part of the multicast destinations. If the multicast sender has to be in the multicast destinations (i.e. must perform a local L1 write), the other API variant noc_semaphore_set_multicast_loopback_src can be used.

Attributes:

AttributeMLIR TypeDescription
linked::mlir::BoolAttrbool attribute
multicast_path_reserve::mlir::BoolAttrbool attribute

Operands:

OperandDescription
src_local_l1_addrTTKernel semaphore
dst_noc_addr_multicastTTKernel noc address
num_dests32-bit signless integer

ttkernel.noc_semaphore_set (tt::ttkernel::NocSemaphoreSetOp)

NocSemaphoreSet

Sets the value of a local L1 memory address on the Tensix core executing this function to a specific value. This L1 memory address is used as a semaphore of size 4 Bytes, as a synchronization mechanism. Also, see noc_semaphore_wait.

Operands:

OperandDescription
sem_addrTTKernel l1 address pointer
valindex or 32-bit signless integer

ttkernel.noc_semaphore_wait_min (tt::ttkernel::NocSemaphoreWaitMinOp)

NocSemaphoreWaitMin

A blocking call that waits until the value of a local L1 memory address on the Tensix core executing this function becomes equal or greater than a target value. This L1 memory address is used as a semaphore of size 4 Bytes, as a synchronization mechanism. Also, see noc_semaphore_set.

Operands:

OperandDescription
sem_addrTTKernel l1 address pointer
val32-bit signless integer

ttkernel.noc_semaphore_wait (tt::ttkernel::NocSemaphoreWaitOp)

NocSemaphoreWait

A blocking call that waits until the value of a local L1 memory address on the Tensix core executing this function becomes equal to a target value. This L1 memory address is used as a semaphore of size 4 Bytes, as a synchronization mechanism. Also, see noc_semaphore_set.

Operands:

OperandDescription
sem_addrTTKernel l1 address pointer
valindex or 32-bit signless integer

ttkernel.pack_tile (tt::ttkernel::PackTileOp)

PackTile op.

Copies a single tile from the DST register buffer at a specified index to a specified CB at a given index. For the out_tile_index to be valid for this call, cb_reserve_back(n) has to be called first to reserve at least some number n > 0 of tiles in the output CB. out_tile_index = 0 then references the first tile in the reserved section of the CB, up to index n - 1, which will then be visible to the consumer in the same order after a cb_push_back call. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Each subsequent pack call will increment the write pointer in the cb by single tile size. The pointer is then again set to a valid position with space for n reserved tiles by another cb_reserve_back call.

Operates in tandem with functions cb_reserve_back and cb_push_back.

A typical use case is first the producer ensures that there is a number of tiles available in the buffer via cb_reserve_back, then the producer uses the pack_tile call to copy a tile from one of DST slots to a slot in reserved space and finally cb_push_back is called to announce visibility of the reserved section of the circular buffer to the consumer.

Attributes:

AttributeMLIR TypeDescription
out_of_order::mlir::BoolAttrbool attribute

Operands:

OperandDescription
dst_indexindex or 32-bit signless integer
out_cbTTKernel cb
out_indexindex or 32-bit signless integer

ttkernel.recip_tile_init (tt::ttkernel::RecipTileInitOp)

Init function for recip_tile operation. Refer to documentation for any init function.

Must be called before recip_tile function.

Traits: TTKernel_InitOpTrait

ttkernel.recip_tile (tt::ttkernel::RecipTileOp)

Recip tile in the DST at specified index.

Performs element-wise computation of the reciprocal on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine. Only works for Float32, Float16_b, Bfp8_b data formats for full accuracy.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

OperandDescription
tile_indexindex or 32-bit signless integer

ttkernel.reduce_init (tt::ttkernel::ReduceInitOp)

Init function

Must be run before reduce_tile.

Traits: TTKernel_InitOpTrait

Attributes:

AttributeMLIR TypeDescription
reduce_type::mlir::tt::ttkernel::ReduceTypeAttrTTKernel Reduce Types
reduce_dim::mlir::tt::ttkernel::ReduceDimAttrTTKernel Reduce Dimensions

Operands:

OperandDescription
in_cbTTKernel cb
scaling_cbTTKernel cb
out_cbTTKernel cb

ttkernel.reduce_tile (tt::ttkernel::ReduceTileOp)

Reduce operation

Performs a reduction operation B = reduce(A) using reduce_func for dimension reduction on a tile in the CB at a given index and writes the result to the DST register at index dst_tile_index. Reduction can be either of type Reduce::R, Reduce::C or Reduce::RC, identifying the dimension(s) to be reduced in size to 1. The DST register buffer must be in acquired state via tile_regs_acquire call. The templates takes reduce_type which can be ReduceFunc::Sum, ReduceFunc::Max and reduce_dim which can be Reduce::R, Reduce::C, Reduce::RC. They can also be specified by defines REDUCE_OP and REDUCE_DIM. This call is blocking and is only available on the compute engine.

Traits: TTKernel_BinaryOpTrait, TTKernel_FPUOpTrait

Attributes:

AttributeMLIR TypeDescription
reduce_type::mlir::tt::ttkernel::ReduceTypeAttrTTKernel Reduce Types
reduce_dim::mlir::tt::ttkernel::ReduceDimAttrTTKernel Reduce Dimensions

Operands:

OperandDescription
in_cbTTKernel cb
scaling_cbTTKernel cb
in_tile_indexindex or 32-bit signless integer
scaling_tile_indexindex or 32-bit signless integer
dst_indexindex or 32-bit signless integer

ttkernel.rounding_op_tile_init (tt::ttkernel::RoundingTileInitOp)

Init function for ceil/floor/round_tile operation. Refer to documentation for any init function.

Must be run before ceil/floor/round_tile.

Traits: TTKernel_InitOpTrait

ttkernel.rsqrt_tile_init (tt::ttkernel::RsqrtTileInitOp)

Short init function which configures compute unit for execution of rsqrt_tile.

Must be run before rsqrt_tile.

Traits: TTKernel_InitOpTrait

ttkernel.rsqrt_tile (tt::ttkernel::RsqrtTileOp)

Rsqrt operation

Performs element-wise computation of reciprocal sqrt on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

OperandDescription
tile_indexindex or 32-bit signless integer

ttkernel.sigmoid_tile_init (tt::ttkernel::SigmoidTileInitOp)

Short init function which configures compute unit for execution of sigmoid_tile.

Must be run before sigmoid_tile.

Traits: TTKernel_InitOpTrait

ttkernel.sigmoid_tile (tt::ttkernel::SigmoidTileOp)

Sigmoid operation

Performs element-wise computation of sigmoid on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

OperandDescription
tile_indexindex or 32-bit signless integer

ttkernel.sin_tile_init (tt::ttkernel::SinTileInitOp)

Init function for sin_tile operation. Refer to documentation for any init function.

Must be run before sin_tile.

Traits: TTKernel_InitOpTrait

ttkernel.sin_tile (tt::ttkernel::SinTileOp)

Sine tile in the DST at specified index.

Performs element-wise computation of sine operation DST[dst0_index] <- sin(DST[dst0_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

OperandDescription
dst0_indexindex or 32-bit signless integer

ttkernel.store_to_l1 (tt::ttkernel::StoreToL1Op)

StoreToL1

Store value to L1.

Operands:

OperandDescription
value32-bit signless integer
l1_ptrTTKernel l1 address pointer
offset32-bit signless integer

ttkernel.sub_tiles_init (tt::ttkernel::SubTilesInitOp)

Short init function

Must be run before sub_tiles.

Traits: TTKernel_InitOpTrait

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb

ttkernel.sub_tiles (tt::ttkernel::SubTilesOp)

Sub operation

Performs element-wise subtraction C=A-B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_BinaryOpTrait, TTKernel_FPUOpTrait

Operands:

OperandDescription
in0_cbTTKernel cb
in1_cbTTKernel cb
in0_tile_indexindex or 32-bit signless integer
in1_tile_indexindex or 32-bit signless integer
dst_indexindex or 32-bit signless integer

ttkernel.tile_regs_acquire (tt::ttkernel::TileRegsAcquireOp)

Tile_regs_acquire

Acquire an exclusive lock on the DST register for the MATH thread. This register is an array of 16 tiles of 32x32 elements each. This is a blocking function, i.e. this function will wait until the lock is acquired.

ttkernel.tile_regs_commit (tt::ttkernel::TileRegsCommitOp)

Tile_regs_commit

Release lock on DST register by MATH thread. The lock had to be previously acquired with tile_regs_acquire.

ttkernel.tile_regs_release (tt::ttkernel::TileRegsReleaseOp)

Tile_regs_release

Release lock on DST register by PACK thread. The lock had to be previously acquired with tile_regs_wait.

ttkernel.tile_regs_wait (tt::ttkernel::TileRegsWaitOp)

Tile_regs_wait

Acquire an exclusive lock on the DST register for the PACK thread. It waits for the MATH thread to commit the DST register. This is a blocking function, i.e. this function will wait until the lock is acquired.

ttkernel.tilize_block (tt::ttkernel::TilizeBlockOp)

TilizeBlockOp call.

TilizeBlockOp operation

Operands:

OperandDescription
cbInTTKernel cb
numTiles32-bit signless integer
cbOutTTKernel cb

ttkernel.tilize_init (tt::ttkernel::TilizeInitOp)

TilizeInitOp call.

Initialize the tilize operation. To be called once at beginning of a kernel.

Operands:

OperandDescription
cbInTTKernel cb
numTiles32-bit signless integer
cbOutTTKernel cb

ttkernel.tilize_init_short (tt::ttkernel::TilizeInitShortOp)

TilizeInitShortOp call.

Re-initialize for the tilize operation. This can be called after a full init.

Operands:

OperandDescription
cbInTTKernel cb
numiles32-bit signless integer
cbOutTTKernel cb

ttkernel.tilize_uninit (tt::ttkernel::TilizeUninitOp)

TilizeUninitOp call.

Uninitialize tilize operation before re-initializing for another operation.

Operands:

OperandDescription
cbITTKernel cb
cbOutTTKernel cb

ttkernel.typecast_tile_init (tt::ttkernel::TypecastTileInitOp)

Init function for typecast_tile operation. Refer to documentation for any init function.

Must be run before typecast_tile.

Traits: TTKernel_InitOpTrait

ttkernel.typecast_tile (tt::ttkernel::TypecastTileOp)

Cast the dataformat of the tile in the DST at specified index.

Performs element-wise typecast operation DST[dst0_index] <- typecast<in_dataformat, out_dataformat>(DST[dst0_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Attributes:

AttributeMLIR TypeDescription
in_dtype::mlir::tt::DataTypeAttrTT DataTypes
out_dtype::mlir::tt::DataTypeAttrTT DataTypes

Operands:

OperandDescription
dst0_indexindex or 32-bit signless integer

ttkernel.unary_op_init_common (tt::ttkernel::UnaryOpInitCommonOp)

Initialization function for unary operations.

This operation initializes all necessary components for unary operations, including unpacking, packing, and math configurations.

Traits: TTKernel_InitOpTrait

Operands:

OperandDescription
icbTTKernel cb
ocbTTKernel cb

ttkernel.unreachable (tt::ttkernel::UnreachableOp)

Unreachable op.

Unreachable operation

Traits: AlwaysSpeculatableImplTrait, ReturnLike, Terminator

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface

Effects: MemoryEffects::Effect{}

ttkernel.untilize_block (tt::ttkernel::UntilizeBlockOp)

UntilizeBlockOp call.

UntilizeBlockOp operation

Operands:

OperandDescription
cbInTTKernel cb
numTiles32-bit signless integer
cbOutTTKernel cb

ttkernel.untilize_init (tt::ttkernel::UntilizeInitOp)

UntilizeInitOp call.

Init function for untilize operations, to be used at the beginning of the kernel.

Operands:

OperandDescription
cbInTTKernel cb
cbOutTTKernel cb

ttkernel.untilize_init_short (tt::ttkernel::UntilizeInitShortOp)

UntilizeInitShortOp call.

Re-initialize for the tilize operation. This can be called after a full init.

Operands:

OperandDescription
cbInTTKernel cb

ttkernel.untilize_uninit (tt::ttkernel::UntilizeUninitOp)

UntilizeUninitOp call.

Uninitialize untilize operation, to allow initializing another operation.

Operands:

OperandDescription
cbInTTKernel cb

CBType

TTKernel cb

Syntax:

!ttkernel.cb<
  CBPort,   # port
  uint64_t,   # address
  MemRefType,   # memref
  uint64_t,   # page_size
  uint64_t   # num_buffers
>

Circular buffer type in TTKernel dialect

Parameters:

ParameterC++ typeDescription
portCBPort
addressuint64_t
memrefMemRefType
page_sizeuint64_t
num_buffersuint64_t

DataFormatType

TTKernel compute data format type

Syntax: !ttkernel.DataFormat

Data format type in TTKernel dialect

InterleavedAddrGenFastType

TTKernel InterleavedAddrGenFast type

Syntax: !ttkernel.interleaved_addr_gen_fast

InterleavedAddrGenFast type in TTKernel dialect

L1AddrType

TTKernel l1 address

Syntax: !ttkernel.l1_addr

L1 address type in TTKernel dialect

L1AddrPtrType

TTKernel l1 address pointer

Syntax: !ttkernel.l1_addr_ptr

L1 pointer address type in TTKernel dialect

NocAddrType

TTKernel noc address

Syntax: !ttkernel.noc_addr

Noc address type in TTKernel dialect

SemaphoreType

TTKernel semaphore

Syntax: !ttkernel.semaphore

Semaphore type in TTKernel dialect

'ttmetal' Dialect

A TTMetal out-of-tree MLIR dialect.

This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

ttmetal.create_buffer (tt::ttmetal::CreateBufferOp)

Create buffer op.

Create buffer operation

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource}

Attributes:

AttributeMLIR TypeDescription
address::mlir::IntegerAttr64-bit signless integer attribute

Results:

ResultDescription
resultnon-0-ranked.memref of any type values

ttmetal.deallocate_buffer (tt::ttmetal::DeallocateBufferOp)

Deallocate buffer op.

Deallocate buffer operation

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Free on ::mlir::SideEffects::DefaultResource}

Operands:

OperandDescription
inputnon-0-ranked.memref of any type values

ttmetal.enqueue_program (tt::ttmetal::EnqueueProgramOp)

Enqueue program op.

Enqueue program operation

Traits: AttrSizedOperandSegments

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Attributes:

AttributeMLIR TypeDescription
cb_ports::mlir::DenseI64ArrayAttri64 dense array attribute
kernelConfigs::mlir::ArrayAttr

Operands:

OperandDescription
buffersvariadic of non-0-ranked.memref of any type values
cbsvariadic of non-0-ranked.memref of any type values

ttmetal.enqueue_read_buffer (tt::ttmetal::EnqueueReadBufferOp)

Enqueue read buffer op.

Enqueue read buffer operation

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Operands:

OperandDescription
inputnon-0-ranked.memref of any type values
outputnon-0-ranked.memref of any type values

ttmetal.enqueue_write_buffer (tt::ttmetal::EnqueueWriteBufferOp)

Enqueue write buffer op.

Enqueue write buffer operation

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Operands:

OperandDescription
inputnon-0-ranked.memref of any type values
outputnon-0-ranked.memref of any type values

ttmetal.finish (tt::ttmetal::FinishOp)

Finish op for command queue.

Global barrier op, used to wait for all commands on queue to finish.

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

'ttnn' Dialect

A TTNN out-of-tree MLIR dialect.

This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

ttnn.abs (tt::ttnn::AbsOp)

Eltwise absolute.

Eltwise absolute operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.add (tt::ttnn::AddOp)

Eltwise add.

Eltwise add operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.all_gather (tt::ttnn::AllGatherOp)

All gather op.

Tensor All Gather operation

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
all_gather_dim::mlir::IntegerAttr32-bit signed integer attribute
cluster_axis::mlir::IntegerAttr32-bit unsigned integer attribute
num_links::mlir::IntegerAttr32-bit unsigned integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.all_reduce (tt::ttnn::AllReduceOp)

All reduce op.

Tensor All Reduce operation

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
reduce_type::mlir::tt::ReduceTypeAttrTT Reduce Type
cluster_axis::mlir::IntegerAttr32-bit unsigned integer attribute
num_links::mlir::IntegerAttr32-bit unsigned integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.alloc (tt::ttnn::AllocOp)

Alloc op.

Tensor Alloc operation

Traits: AlwaysSpeculatableImplTrait, TT_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
address::mlir::IntegerAttr64-bit signless integer attribute
size::mlir::IntegerAttr64-bit signless integer attribute
buffer_type::mlir::tt::ttnn::BufferTypeAttrTTNN Buffer Type

Results:

ResultDescription
resultranked tensor of any type values

ttnn.arange (tt::ttnn::ArangeOp)

Arange operation.

Tensor arange operation.

Produces a (1, 1, 1, N)-shaped tensor with values from start to end (exclusive) with a step size of step.

Examples: %0 = "ttnn.arange"() {start = 0 : i64, end = 5 : i64 step = 1 : i64} : () -> tensor<1x1x1x5xi64> // %0: [[[[0, 1, 2, 3, 4]]]]

%1 = "ttnn.arange"() {start = 0 : i64, end = 10 : i64, step = 2 : i64} : () -> tensor<1x1x1x5xf32> // %1: [[[[0.0, 2.0, 4.0, 6.0, 8.0]]]]

Traits: AlwaysSpeculatableImplTrait, TT_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
start::mlir::IntegerAttr64-bit signless integer attribute
end::mlir::IntegerAttr64-bit signless integer attribute
step::mlir::IntegerAttr64-bit signless integer attribute
dtype::mlir::tt::DataTypeAttrTT DataTypes
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.argmax (tt::ttnn::ArgMaxOp)

Argmax reduction op.

Determine the indices of the maximum values along a specified dimension of a tensor or over all elements in a tensor.

Parameters:

  • input: The input tensor.
  • dim: Specifies the dimension along which the argmax is applied.
  • keep_dim: If set to true, the output tensor will have the same number of dimensions as the input tensor.
  • use_multicore: Whether to use multiple cores or not.

IR usage: // Input tensor of shape (128, 28, 28, 64) %input = ... : tensor<128x28x28x64xbf16>

%empty = "ttnn.empty"(%0) <{dtype = #tt.supportedDataTypes, ....}> : -> tensor<128x28x28xi32> %4 = "ttnn.argmax"(%input, %empty) <{dim = 3 : i32, use_multicore = false}> : (tensor<128x28x28xbf16>, tensor<128x28x28xi32) -> tensor<128x28x28xi32>

Example: input: [[1, 5, 3], [2, 4, 6]]

// Computing along dim 0 output: [1, 0, 1]

// Computing along dim 1 output: [1, 2]

// Computing for entire tensor output: 5

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signless integer attribute
keep_dim::mlir::BoolAttrbool attribute
use_multicore::mlir::BoolAttrbool attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.atan2 (tt::ttnn::Atan2Op)

Eltwise atan2 OP.

Performs element-wise atan2 operation on lhs and rhs tensor and produces a result tensor.

Example:

  // %lhs: [0.0, 1.0, -1.0]
  // %rhs: [1.0, 0.0, 0.0]
  %result = "ttnn.atan2"(%lhs, %rhs) : (tensor<3xf64>, tensor<3xf64>) -> tensor<3xf64>
  // %result: [0.0, 1.57079637, -1.57079637] // [0.0, pi/2, -pi/2]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.atan (tt::ttnn::AtanOp)

Eltwise arctangent op.

Performs an elementwise arctangent (atan) operation on the input tensor. This operation computes the inverse tangent of each element, returning values in the range [-π/2, π/2]. Supports floating-point tensor types.

Example:

%input = tensor<4xf32> {1.0, 0.5, 0.0, -1.0}
%result = "ttir.atan"(%input) : (tensor<4xf32>) -> tensor<4xf32>

Given the input [1.0, 0.5, 0.0, -1.0], the result would be approximately: [0.785, 0.464, 0.0, -0.785] (values in radians).

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.avg_pool2d (tt::ttnn::AvgPool2dOp)

Applies a 2D average pooling over an input signal composed of several input planes.

It is a downsampling operation to reduce the spatial dimensions (height and width) of a input tensor by computing averages with in a window.

Example: // 3x3 input tensor input: [[1, 2, 3], [4, 5, 6], [7, 8, 9]] kernel_height: 2 kernel_width: 2 stride_height: 1 stride_width: 1 dilation_height: 1 dilation_width: 1 output: [[3, 4], [6, 7]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
batch_size::mlir::IntegerAttr32-bit signed integer attribute
input_height::mlir::IntegerAttr32-bit signed integer attribute
input_width::mlir::IntegerAttr32-bit signed integer attribute
channels::mlir::IntegerAttr32-bit signed integer attribute
kernel_size::mlir::DenseI32ArrayAttri32 dense array attribute
stride::mlir::DenseI32ArrayAttri32 dense array attribute
padding::mlir::DenseI32ArrayAttri32 dense array attribute
dilation::mlir::DenseI32ArrayAttri32 dense array attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

applied_shard_scheme::mlir::tt::ttnn::TensorMemoryLayoutAttrTTNN Tensor Memory Layout
ceil_mode::mlir::BoolAttrbool attribute
in_place_halo::mlir::BoolAttrbool attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.batch_norm (tt::ttnn::BatchNormOp)

Batch normalization op.

Batch normalization operation over each channel on input tensor.

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
training::mlir::BoolAttrbool attribute
epsilon::mlir::FloatAttr32-bit float attribute
momentum::mlir::FloatAttr32-bit float attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
running_meanranked tensor of any type values
running_varranked tensor of any type values
weightranked tensor of any type values
biasranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.bitwise_and (tt::ttnn::BitwiseAndOp)

Eltwise bitwise AND.

Performs element-wise bitwise AND of two tensors lhs and rhs and produces a result tensor.

Example: // %lhs: [[1, 2], [3, 4]] // %rhs: [[5, 6], [7, 8]] %result = "ttnn.bitwise_and"(%lhs, %rhs) : (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32> // %result: [[1, 2], [3, 0]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.bitwise_not (tt::ttnn::BitwiseNotOp)

Eltwise bitwise NOT.

Performs element-wise NOT of tensor operand and produces a result tensor.

Example: // Bitwise operation with with integer tensors // %operand: [[1, 2], [3, 4]] %result = "ttnn.bitwise_not"(%operand) : (tensor<2x2xi32>) -> tensor<2x2xi32> // %result: [[-2, -3], [-4, -5]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.bitwise_or (tt::ttnn::BitwiseOrOp)

Eltwise bitwise OR.

Performs element-wise bitwise OR of two tensors lhs and rhs and produces a result tensor.

Example: // %lhs: [[1, 2], [3, 4]] // %rhs: [[5, 6], [7, 8]] %result = "ttnn.bitwise_or"(%lhs, %rhs) : (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32> // %result: [[5, 6], [7, 12]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.bitwise_xor (tt::ttnn::BitwiseXorOp)

Eltwise bitwise XOR.

Performs element-wise bitwise XOR of two tensors lhs and rhs and produces a result tensor.

Example: // %lhs: [[1, 2], [3, 4]] // %rhs: [[5, 6], [7, 8]] %result = "ttnn.bitwise_xor"(%lhs, %rhs) : (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32> // %result: [[4, 4], [4, 12]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.cbrt (tt::ttnn::CbrtOp)

Eltwise cubic root.

Eltwise cubic root operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.ceil (tt::ttnn::CeilOp)

Eltwise ceil.

Eltwise ceil operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.clamp_scalar (tt::ttnn::ClampScalarOp)

Clamp op.

Clamp tensor values to a specified range.

Example: min: 2.000000+00 input: [[0, 1, 2, 3, 4, 5, 6, 7]] max: 5.000000+00

"ttnn.clamp_scalar"(%arg0) <{max = 2.000000e+00 : f32, min = 5.000000e+00 : f32}> -> %out = [[2, 2, 2, 3, 4, 5, 5, 5]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
min::mlir::FloatAttr32-bit float attribute
max::mlir::FloatAttr32-bit float attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.clamp_tensor (tt::ttnn::ClampTensorOp)

Clamp op.

Clamp tensor values to a specified range using min/max as tensor.

Example: min: [[2, 2, 2, 3, 3, 3, 0, 0]] input: [[0, 1, 2, 3, 4, 5, 6, 7]] max: [[5, 5, 5, 9, 9, 9, 6, 6]]

"ttnn.clamp_tensor"(%input, %min, %max) %out: [[2, 2, 2, 3, 4, 5, 6, 6]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
minranked tensor of any type values
maxranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.collective_permute (tt::ttnn::CollectivePermuteOp)

Collective permute op.

Collective permute op. This operation ingests a multi-device tensor spread across multi-devices and will shuffle the data according to source_target_pairs [['src', 'dest']].

Example: For a 1x2 mesh, the following will take the device shard living in device 0 and move it to device 1. The device shard living in device 1 will move to device 0. %source_target_pairs: [[0, 1], [1, 0]]

In the case of missing 'dest', the device shard living on that device will contain values of 0. For example, device shard living in device 0 will contain 0 values. %source_target_pairs: [[0, 1]]

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
source_target_pairs::mlir::DenseIntElementsAttr64-bit signless integer elements attribute

Operands:

OperandDescription
inputranked tensor of any type values
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.concat (tt::ttnn::ConcatOp)

Concat op.

Concat tensors along a given dimension.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr32-bit signed integer attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputsvariadic of ranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.constant (tt::ttnn::ConstantOp)

Constant op.

Produces tensor filled with given constant value.

Examples: %0 = "ttnn.constant"() {value = dense<[[3, 4, 2], [1, 7, 8]]> : tensor<2x3xui16>} : () -> tensor<2x3xui16> // %0: [[3, 4, 2], [1, 7, 8]] %1 = "ttnn.constant"() {value = dense<[0.2, 1.3]> : tensor<2xf32>} : () -> tensor<2xf32> // %1: [0.2, 1.3]

Traits: AlwaysSpeculatableImplTrait, TT_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
value::mlir::ElementsAttrconstant vector/tensor attribute

Results:

ResultDescription
resultranked tensor of any type values

ttnn.conv2d (tt::ttnn::Conv2dOp)

Conv2d operation.

Applies a 2D convolution over an input image composed of several input planes.

Inputs:

  • input (AnyRankedTensor): expected in the following flattened format (1, 1, N * H_in * W_in, C) where:
    • N is the batch size
    • H_in is the height of the input planes
    • W_in is the width of the input planes
    • C is the number of channels
  • weight (AnyRankedTensor): expected in the following format (O, C/G, K_H, K_W).
  • bias (Optional): expected in the following format (1, 1, 1, O) where:
    • C is the number of input channels
    • O is the number of output channels
    • G is the number of groups
    • K_H is the height of the kernel
    • K_W is the width of the kernel

Attributes:

  • in_channels (i32): The number of input channels.
  • out_channels (i32): The number of output channels.
  • batch_size (i32): The batch size.
  • input_height (i32): The input height.
  • input_width (i32): The input width.
  • kernel_size (array<2xi32>): [K_H, K_W] where K_H is the kernel height and K_W is the kernel width.
  • stride (array<2xi32>): [sH, sW] where sH is stride for height and sW is stride for width.
  • padding (array<2xi32> | array<4xi32>):
    • array<2xi32>: [pH, pW] where pH is padding for height (top/bottom) and pW is padding for width (left/right).
    • array<4xi32>: [pT, pB, pL, pR] for top, bottom, left, and right padding respectively.
  • dilation (array<2xi32>): [dH, dW] where dH is dilation for height and dW is dilation for width.
  • groups (i32): Number of blocked connections from input channels to output channels. Input and output channels must both be divisible by groups.

Outputs:

  • result (AnyRankedTensor): returned in the following flattened format (1, 1, N * H_out * W_out, O) where:
    • H_out = (H_in + pT + pB - dH * (K_H - 1) - 1) / sH + 1
    • W_out = (W_in + pL + pR - dW * (K_W - 1) - 1) / sW + 1

Example: %input = ttir.empty() : () -> tensor<1x1x1024x64xbf16> %weight = ttir.empty() : () -> tensor<64x64x3x3xbf16> %bias = ttir.empty() : () -> tensor<1x1x1x64xbf16> %device = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !ttnn.device %0 = "ttnn.conv2d"(%input, %weight, %bias, %device) <{ in_channels = 64: i32, out_channels = 64: i32, batch_size = 1: i32, input_height = 32: i32, input_width = 32: i32, kernel_size = array<i32: 3, 3>, stride = array<i32: 1, 1>, padding = array<i32: 0, 0>, dilation = array<i32: 1, 1>, groups = 1: i32 }> : (tensor<1x1x1024x64xbf16>, tensor<64x64x3x3xbf16>, tensor<1x1x1x64xbf16>, !ttnn.device) -> tensor<1x1x900x64xbf16>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
in_channels::mlir::IntegerAttr32-bit signless integer attribute
out_channels::mlir::IntegerAttr32-bit signless integer attribute
batch_size::mlir::IntegerAttr32-bit signless integer attribute
input_height::mlir::IntegerAttr32-bit signless integer attribute
input_width::mlir::IntegerAttr32-bit signless integer attribute
kernel_size::mlir::DenseI32ArrayAttri32 dense array attribute
stride::mlir::DenseI32ArrayAttri32 dense array attribute
padding::mlir::DenseI32ArrayAttri32 dense array attribute
dilation::mlir::DenseI32ArrayAttri32 dense array attribute
groups::mlir::IntegerAttr32-bit signless integer attribute
conv2d_config::mlir::tt::ttnn::Conv2dConfigAttr
TTNN Conv2dConfig attribute{{% markdown %}} TTNN conv2d config attribute {{% /markdown %}}
compute_config::mlir::tt::ttnn::DeviceComputeKernelConfigAttr
TTNN DeviceComputeKernelConfig attribute{{% markdown %}} The TTNN_DeviceComputeKernelConfig attribute configures compute kernel execution parameters for tensor operations on Tenstorrent devices. This attribute provides fine-grained control over mathematical precision, memory usage, and synchronization behavior during compute operations.
Parameters:
  - `math_fidelity`: Controls the mathematical precision and accuracy of compute operations. This parameter affects the trade-off between computational speed and numerical precision. Higher fidelity modes provide more accurate results but may require additional computational cycles.
  - `math_approx_mode`: Configures SFPU operation mode:
    - Precise mode (false): Higher accuracy with more computational cycles and better PCC
    - Approximate mode (true): Faster execution with fewer cycles but reduced accuracy
  - `fp32_dest_acc_en`: Configures destination registers to use 32-bit floating-point precision instead of the default 16-bit mode. It provides higher precision at the cost of reducing available destination register count by half.
  - `packer_l1_acc`: When packing multiple tiles to the same address, subsequent packs perform accumulation (addition using FP16 or FP32 precision) rather than overwriting.
  - `dst_full_sync_en`: Configures destination register acquisition mode:
    - Half mode (false): Acquires 8 tiles in destination registers
    - Full mode (true): Acquires 16 tiles in destination registers, providing increased parallelism at the cost of higher resource usage

Example:

```mlir
#device_compute_kernel_config = #ttnn.device_compute_kernel_config<
  math_fidelity = lofi,
  math_approx_mode = true,
  fp32_dest_acc_en = false,
  packer_l1_acc = false,
  dst_full_sync_en = false
>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
biasranked tensor of any type values
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.conv_transpose2d (tt::ttnn::ConvTranspose2dOp)

ConvTranspose2d operation.

Applies a 2D transposed convolution operator over an input image composed of several input planes.

Inputs:

  • input AnyRankedTensor: expected in the following format (N, H_in, W_in, C) where:

    • N is the batch size
    • H_in is the height of the input planes
    • W_in is the width of the input planes
    • C is the number of channels
  • weight AnyRankedTensor: expected in the following format (C, O/G, K_H, K_W).

  • bias Optional: expected in the following format (1, 1, 1, O) where:

    • C is the number of input channels
    • O is the number of output channels
    • G is the number of groups
    • K_H is the height of the kernel
    • K_W is the width of the kernel
  • output AnyRankedTensor: expected in the following format (N, H_out, W_out, O) where:

    • H_out = (H_in - 1) * stride[0] - 2 * padding[0] + dilation[0] * (K_H - 1) + output_padding[0] + 1
    • W_out = (W_in - 1) * stride[1] - 2 * padding[1] + dilation[1] * (K_W - 1) + output_padding[1] + 1

Attributes:

  • in_channels i32: The number of input channels.
  • out_channels i32: The number of output channels.
  • batch_size i32: The batch size.
  • input_height i32: The input height.
  • input_width i32: The input width.
  • kernel_size array<2xi32>: The kernel size.
  • stride array<2xi32>: Controls the stride for the cross-correlation.
  • padding array<2xi32>: Controls the amount of implicit zero padding on both sides for dilation * (kernel_size - 1) - padding number of points.
  • output_padding array<2xi32>: Controls the additional size added to one side of the output shape.
  • dilation array<2xi32>: Controls the spacing between the kernel points
  • groups i32: Controls the connections between inputs and outputs. Must be divisible by input and output channels.

Example: // %input: tensor<3x8x8x256xbf16> // %weight: tensor<256x256x3x3xbf16> // %bias: tensor<1x1x1x256xbf16> // %output: tensor<3x10x10x256xbf16> %0 = "ttnn.conv_transpose2d"(%input, %weight, %bias, %output, %device) <{ batch_size = 3: i32, dilation = array<i32: 1, 1>, groups = 1: i32, in_channels = 256: i32, input_height = 8: i32, input_width = 8: i32, kernel_size = array<i32: 3, 3>, out_channels = 256: i32, output_padding = array<i32: 0, 0>, padding = array<i32: 0, 0>, stride = array<i32: 1, 1> > : (tensor<3x8x8x256xbf16>, tensor<256x256x3x3xbf16>, tensor<1x1x1x256xbf16>, tensor<3x10x10x256xbf16>) -> tensor<3x10x10x256xbf16>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
in_channels::mlir::IntegerAttr32-bit signless integer attribute
out_channels::mlir::IntegerAttr32-bit signless integer attribute
batch_size::mlir::IntegerAttr32-bit signless integer attribute
input_height::mlir::IntegerAttr32-bit signless integer attribute
input_width::mlir::IntegerAttr32-bit signless integer attribute
kernel_size::mlir::DenseI32ArrayAttri32 dense array attribute
stride::mlir::DenseI32ArrayAttri32 dense array attribute
padding::mlir::DenseI32ArrayAttri32 dense array attribute
output_padding::mlir::DenseI32ArrayAttri32 dense array attribute
dilation::mlir::DenseI32ArrayAttri32 dense array attribute
groups::mlir::IntegerAttr32-bit signless integer attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
biasranked tensor of any type values
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.cos (tt::ttnn::CosOp)

Eltwise cosine.

Eltwise cosine operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.deallocate (tt::ttnn::DeallocateOp)

Deallocate op.

Tensor Deallocate operation

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
force::mlir::BoolAttrbool attribute

Operands:

OperandDescription
inputranked tensor of any type values

ttnn.dequantize (tt::ttnn::DequantizeOp)

Dequantize operation.

Applies dequantization to the input tensor.

Inputs:

  • input AnyRankedTensor: The input tensor to be dequantized. Must have quantized element type.
  • scale AnyRankedTensor: The scale factor (or factors for per-axis quantization).
  • zero_point AnyRankedTensor: The zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
  • axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.
  • output_dtype Optional<TT_DataTypeAttr>: The data type of the output tensor.
  • memory_config Optional<TTNN_MemoryConfigAttr>: The memory configuration for the output tensor.
// For per-tensor dequantization:
output[i] = (input[i] - zero_point) * scale
// For per-axis dequantization:
output[i0, i1, ..., ia, ..., in] = (input[i0, i1, ..., ia, ..., in] - zero_point[ia]) * scale[ia]

Example:

%input = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%output = ttir.empty() : () -> tensor<64x128xf32>
%dequantized = "ttnn.dequantize"(%input, %output) : (tensor<64x128x!quant.uniform<i32:f32, 0.1>, tensor<64x128xf32>) -> tensor<64x128xf32>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
axis::mlir::IntegerAttr32-bit signless integer attribute
output_dtype::mlir::tt::DataTypeAttrTT DataTypes
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
scaleranked tensor of any type values
zero_pointranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.divide (tt::ttnn::DivideOp)

Eltwise divide.

Eltwise divide operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.embedding_bw (tt::ttnn::EmbeddingBackwardOp)

Embedding backward op.

Embedding backward operation. Generates the gradient of the embedding operation with respect to the input.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dtype::mlir::tt::DataTypeAttrTT DataTypes
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values
in_gradientranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.embedding (tt::ttnn::EmbeddingOp)

Embedding op.

Embedding operation.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values
weightranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.empty (tt::ttnn::EmptyOp)

Empty op.

Tensor empty operation

Traits: AlwaysSpeculatableImplTrait, TT_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shape::mlir::tt::ttnn::ShapeAttr
TTNN Shape attribute{{% markdown %}} TTNN shape attribute {{% /markdown %}}
dtype::mlir::tt::DataTypeAttrTT DataTypes
layout::mlir::tt::ttnn::LayoutAttrTTNN Layout
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.eq (tt::ttnn::EqualOp)

Eltwise equal to.

Eltwise equal to operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.erf (tt::ttnn::ErfOp)

Eltwise erf op.

Eltwise erf operation. Calculates erf(x) for each element of the input tensor.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.erfc (tt::ttnn::ErfcOp)

Eltwise erfc op.

Eltwise erfc operation. Calculates erfc(x) for each element of the input tensor.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.exp (tt::ttnn::ExpOp)

Eltwise exponential.

Eltwise exponential operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.expm1 (tt::ttnn::Expm1Op)

Eltwise unary op.

Performs element-wise exponential minus one operation on operand tensor and stores the result in the output tensor.

Example: %a: [[0, 1], [0, 0]] "ttnn.exmp1"(%a, %out) -> %out: [[0, 1.71828], [0, 0]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.fill_cache (tt::ttnn::FillCacheOp)

Fill static cache tensor.

Fills the cache tensor in-place with values from input at batch_offset.

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Attributes:

AttributeMLIR TypeDescription
batch_offset::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
cacheranked tensor of any type values
inputranked tensor of any type values

ttnn.floor (tt::ttnn::FloorOp)

Eltwise floor op.

Eltwise floor operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.from_device (tt::ttnn::FromDeviceOp)

FromDevice op.

This op retrieves the input tensor from the given device.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.full (tt::ttnn::FullOp)

Creates a tensor filled with the specified value

Tensor operation to create a tensor filled with a specified value.

Given a shape and a fill_value, produces a tensor with the shape, filled with the specified value.

Example: %0 = "ttnn.full"() <{ dtype = #tt.supportedDataTypes, fill_value = 3 : i32, layout = #ttnn.layout, shape = #ttnn.shape<64x128> }> : () -> tensor<64x128xui32> // %0: [[[7, 7, 7, ..., 7], [7, 7, 7, ..., 7], ..., [7, 7, 7, ..., 7]]]

Traits: AlwaysSpeculatableImplTrait, TT_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shape::mlir::tt::ttnn::ShapeAttr
TTNN Shape attribute{{% markdown %}} TTNN shape attribute {{% /markdown %}}
fill_value::mlir::Attribute32-bit float attribute or 32-bit signless integer attribute
dtype::mlir::tt::DataTypeAttrTT DataTypes
layout::mlir::tt::ttnn::LayoutAttrTTNN Layout
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.gelu (tt::ttnn::GeluOp)

Eltwise GELU.

Eltwise GELU operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.get_device (tt::ttnn::GetDeviceOp)

Get Device op.

This op returns a submesh carved out from the parent runtime device. Mesh shape and mesh offset define the size and offset of the submesh.

Traits: AlwaysSpeculatableImplTrait, TT_DuplicateConstEvalTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
mesh_shape::mlir::tt::ttnn::MeshShapeAttr
TTNN Mesh Shape{{% markdown %}} TTNN mesh shape representing the dimensions of a 2D mesh. {{% /markdown %}}
mesh_offset::mlir::tt::ttnn::MeshOffsetAttr
TTNN Mesh Offset{{% markdown %}} TTNN mesh offset representing the starting coordinates in a 2D mesh. {{% /markdown %}}

Results:

ResultDescription
deviceTTNN device

ttnn.ge (tt::ttnn::GreaterEqualOp)

Eltwise greater than or equal to.

Eltwise greater than or equal to operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.gt (tt::ttnn::GreaterThanOp)

Eltwise greater than.

Eltwise greater than operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.isfinite (tt::ttnn::IsFiniteOp)

Eltwise isfinite op.

Eltwise isfinite operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.leaky_relu (tt::ttnn::LeakyReluOp)

Eltwise leaky relu operation.

The Leaky ReLU (Rectified Linear Unit) operation computes an element-wise activation function over its input tensor. It is defined as:

y = x if x > 0 y = parameter * x if x <= 0

where parameter is a small, user-defined constant that determines the slope for negative inputs.

Attributes:

  • parameter (float): The slope for negative values.

Inputs:

  • input (Tensor): The input tensor to be activated.

Outputs:

  • output (Tensor): The tensor after applying the Leaky ReLU activation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
parameter::mlir::FloatAttr32-bit float attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.le (tt::ttnn::LessEqualOp)

Eltwise less than or equal to.

Eltwise less than or equal to operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.lt (tt::ttnn::LessThanOp)

Eltwise less than.

Eltwise less than operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.linear (tt::ttnn::LinearOp)

Linear transformation of inputs.

Produces the matmul of tensors a and b with optional addition with bias.

Example: // %a = [[1., 2.]], [2., 1.]] // %b = [[0., 1.], [1., 0.]] // %bias = [[1.]] "ttnn.linear"(%a, %b, %bias, %result) : (tensor<2x2xf16>, tensor<2x2xf16>, tensor<1xf16>, tensor<2x2xf16>) -> tensor<2x2xf16> // %result = [[3., 2.], [2., 3.]]

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
transpose_a::mlir::BoolAttrbool attribute
transpose_b::mlir::BoolAttrbool attribute

Operands:

OperandDescription
aranked tensor of any type values
branked tensor of any type values
biasranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.log1p (tt::ttnn::Log1pOp)

Eltwise log1p operation.

Performs element-wise logarithm plus one operation on operand tensor and puts the result in the output tensor.

Example: %a: [0.0, -0.999, 7.0, 6.38905621, 15.0] "ttnn.logp1"(%a, %out) -> %out: [0.0, -6.90776825, 2.07944155, 2.0, 2.77258873]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.log (tt::ttnn::LogOp)

Eltwise logarithm.

Eltwise logarithm operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.logical_and (tt::ttnn::LogicalAndOp)

Eltwise logical and.

Eltwise logical and operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.logical_not (tt::ttnn::LogicalNotOp)

Eltwise logical not op.

Eltwise logical not operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.logical_or (tt::ttnn::LogicalOrOp)

Eltwise logical or.

Eltwise logical or operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.logical_xor (tt::ttnn::LogicalXorOp)

Eltwise logical xor.

Eltwise logical xor operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.matmul (tt::ttnn::MatmulOp)

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
transpose_a::mlir::BoolAttrbool attribute
transpose_b::mlir::BoolAttrbool attribute
matmul_program_config::mlir::AttributeTTNN MatmulMultiCoreReuseProgramConfig or TTNN MatmulMultiCoreReuseMultiCastProgramConfig or TTNN MatmulMultiCoreReuseMultiCast1DProgramConfig or TTNN MatmulMultiCoreReuseMultiCastDRAMShardedProgramConfig

Operands:

OperandDescription
aranked tensor of any type values
branked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.max (tt::ttnn::MaxOp)

Max reduction op.

Max reduction op.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.max_pool2d (tt::ttnn::MaxPool2dOp)

Applies a 2D max pooling over an input signal composed of several input planes.

Applies a 2D max pooling over an input signal composed of several input planes.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
batch_size::mlir::IntegerAttr32-bit signed integer attribute
input_height::mlir::IntegerAttr32-bit signed integer attribute
input_width::mlir::IntegerAttr32-bit signed integer attribute
channels::mlir::IntegerAttr32-bit signed integer attribute
kernel_size::mlir::DenseI32ArrayAttri32 dense array attribute
stride::mlir::DenseI32ArrayAttri32 dense array attribute
padding::mlir::DenseI32ArrayAttri32 dense array attribute
dilation::mlir::DenseI32ArrayAttri32 dense array attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

applied_shard_scheme::mlir::tt::ttnn::TensorMemoryLayoutAttrTTNN Tensor Memory Layout
ceil_mode::mlir::BoolAttrbool attribute
in_place_halo::mlir::BoolAttrbool attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.maximum (tt::ttnn::MaximumOp)

Eltwise maximum OP.

Calculates maximum of input tensors' values element-wise and stores result in output tensor.

Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttnn.maximum"(%lhs, %rhs, %out) -> %out: [[3, 4, 7], [1, 4, 4]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.mean (tt::ttnn::MeanOp)

Mean reduction op.

Mean reduction op.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.mesh_shard (tt::ttnn::MeshShardOp)

Mesh shard op.

Tensor Mesh Shard operation

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shard_direction::mlir::tt::MeshShardDirectionAttrTT MeshShardDirection
shard_type::mlir::tt::MeshShardTypeAttr
MeshShard shard_type attribute in TT dialect{{% markdown %}} Define sharded tensor data of mesh_shard op. - Identity: input and output tensors are pre-sharded (same data) and no sharding is required. - Replicate: all of the devices has full tensor (same data). - Maximal: one or part of the devcices has full tensor (same data). - Devices: all or part of the devices has sharded (partial) tensor (different data). {{% /markdown %}}
shard_shape::mlir::DenseI64ArrayAttri64 dense array attribute
shard_dims::mlir::DenseI64ArrayAttri64 dense array attribute

Operands:

OperandDescription
inputranked tensor of any type values
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.min (tt::ttnn::MinOp)

Min reduction op.

This op computes the minimum of all elements of the tensor or along specified dimension.

Example: input: [[1, 5, 3], [4, 2, 6]]

// Computing along dim 0 output: [1, 2, 3]

// Computing along dim 1 output: [1, 2]

// Computing for entire tensor output: 1

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.minimum (tt::ttnn::MinimumOp)

Eltwise minimum OP.

Calculates minimum of input tensors' values element-wise and stores result in output tensor.

Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttnn.minimum"(%lhs, %rhs, %out) -> %out: [[1, 2, 2], [1, 2, 3]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.moreh_cumsum (tt::ttnn::MorehCumSumOp)

Moreh cummulative sum op.

Computes the cumulative sum of elements of a tensor along specified dimension.

Example: input: [[1, 2, 3], [4, 5, 6]]

// Cumulative sum along dim=0: output: [[1, 2, 3], [5, 7, 9]]

// Cumulative sum along dim=1: output: [[1, 3, 6], [4, 9, 15]]

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim::mlir::IntegerAttr64-bit signless integer attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.multiply (tt::ttnn::MultiplyOp)

Eltwise multiply.

Eltwise multiply operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.neg (tt::ttnn::NegOp)

Eltwise negate.

Eltwise negate operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.ne (tt::ttnn::NotEqualOp)

Eltwise not equal to.

Eltwise not equal to operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.ones (tt::ttnn::OnesOp)

Creates a tensor filled with ones.

Tensor operation to create a tensor filled with ones.

Given a ShapeAttr shape, produces a tensor with the same shape, filled with ones.

Example: %0 = "ttnn.ones"() <{shape = array<i32:64, 28, 28>}> : () -> tensor<64x28x28xbf16> // %0: [[[1, 1, 1, ..., 1], [1, 1, 1, ..., 1], ..., [1, 1, 1, ..., 1]]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait, TT_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shape::mlir::tt::ttnn::ShapeAttr
TTNN Shape attribute{{% markdown %}} TTNN shape attribute {{% /markdown %}}
dtype::mlir::tt::DataTypeAttrTT DataTypes
layout::mlir::tt::ttnn::LayoutAttrTTNN Layout
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.pad (tt::ttnn::PadOp)

Pad op.

Pad input tensor by padding the input_shape to output_shape using the provided value.

The padding attribute must be a sequence of integers that is twice the size as the rank of the input. Each pair of integers in the padding attribute represents the amount of padding to add to the low and high of that dimension. I.e: an input tensor of shape <1x30x30x64xf32> with padding attribute <0, 0, 1, 1, 1, 1, 0, 0> will return a tensor of shape <1x32x32x64xf32>, and so will a padding attribute of <0, 0, 0, 2, 0, 2, 0, 0>.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
padding::mlir::DenseI32ArrayAttri32 dense array attribute
value::mlir::FloatAttr32-bit float attribute
use_multicore::mlir::BoolAttrbool attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.permute (tt::ttnn::PermuteOp)

Permute operation.

Permute input tensor dimensions.

Attributes:

  • permutation array: The permutation of the input tensor dimensions.

Example: %a = ttir.empty() : () -> tensor<2x3x4xi32> %0 = "ttir.permute"(%a) {permutation = array<i64: 1, 2, 0>} : (tensor<2x3x4xi32>) -> tensor<3x4x2xi32>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
permutation::mlir::DenseI64ArrayAttri64 dense array attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

pad_value::mlir::FloatAttr32-bit float attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.pow (tt::ttnn::PowOp)

Eltwise power OP.

Performs element-wise exponentiation of lhs tensor by rhs tensor and produces a result tensor. Tensors must be of same shape.

Example:

  %result = "ttnn.pow"(%lhs, %rhs) : (tensor<6xf64>, tensor<6xf64>) -> tensor<6xf64>

  %lhs: [-2.0, -0.0, -36.0, 5.0, 3.0, 10000.0]
  %rhs: [2.0, 2.0, 1.1, 2.0, -1.0, 10.0]
  %result: [4.0, 0.0, -nan, 25.0, 0.333333343, inf]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.prepare_conv2d_weights (tt::ttnn::PrepareConv2dWeightsOp)

Prepares conv2d weights so that they can be consumed by the conv2d op.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
input_memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

input_tensor_layout::mlir::tt::ttnn::LayoutAttrTTNN Layout
weights_format::mlir::StringAttrstring attribute
in_channels::mlir::IntegerAttr32-bit signless integer attribute
out_channels::mlir::IntegerAttr32-bit signless integer attribute
batch_size::mlir::IntegerAttr32-bit signless integer attribute
input_height::mlir::IntegerAttr32-bit signless integer attribute
input_width::mlir::IntegerAttr32-bit signless integer attribute
kernel_size::mlir::DenseI32ArrayAttri32 dense array attribute
stride::mlir::DenseI32ArrayAttri32 dense array attribute
padding::mlir::DenseI32ArrayAttri32 dense array attribute
dilation::mlir::DenseI32ArrayAttri32 dense array attribute
has_bias::mlir::BoolAttrbool attribute
groups::mlir::IntegerAttr32-bit signless integer attribute
conv2d_config::mlir::tt::ttnn::Conv2dConfigAttr
TTNN Conv2dConfig attribute{{% markdown %}} TTNN conv2d config attribute {{% /markdown %}}

Operands:

OperandDescription
weight_tensorranked tensor of any type values
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.prod (tt::ttnn::ProdOp)

Product reduction op.

This op computes the product of all elements of the tensor (full product) or along a specific dimension.

Example: input: [[1, 2, 3], [4, 5, 6]]

// Computing along dim 0 output: [4, 10, 18]

// Computing along dim 1 output: [6, 120]

// Computing full product output: 720

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim_arg::mlir::IntegerAttr64-bit signless integer attribute
keep_dim::mlir::BoolAttrbool attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.quantize (tt::ttnn::QuantizeOp)

Quantize operation.

Applies quantization to the input tensor.

Inputs:

  • input AnyRankedTensor: The input tensor to be quantized. Must have floating-point element type.
  • scale AnyRankedTensor: The scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
  • zero_point AnyRankedTensor: The zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
  • axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.
  • output_dtype Optional<TT_DataTypeAttr>: The data type of the output tensor.
  • memory_config Optional<TTNN_MemoryConfigAttr>: The memory configuration for the output tensor.
// For per-tensor quantization:
output[i] = round(input[i] / scale) + zero_point
// For per-axis quantization:
output[i0, i1, ..., ia, ..., in] = round(input[i0, i1, ..., ia, ..., in] / scale[ia]) + zero_point[ia]

Example:

%input = ttir.empty() : () -> tensor<64x128xf32>
%output = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%quantized = "ttir.quantize"(%input, %output) : (tensor<64x128xf32>, tensor<64x128x!quant.uniform<i32:f32, 0.1>>) -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
axis::mlir::IntegerAttr32-bit signless integer attribute
output_dtype::mlir::tt::DataTypeAttrTT DataTypes
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
scaleranked tensor of any type values
zero_pointranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.reciprocal (tt::ttnn::ReciprocalOp)

Eltwise reciprocal.

Eltwise reciprocal operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.reduce_scatter (tt::ttnn::ReduceScatterOp)

Reduce scatter op.

Tensor Reduce Scatter operation

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
reduce_type::mlir::tt::ReduceTypeAttrTT Reduce Type
scatter_dim::mlir::IntegerAttr32-bit signed integer attribute
cluster_axis::mlir::IntegerAttr32-bit unsigned integer attribute
num_links::mlir::IntegerAttr32-bit unsigned integer attribute

Operands:

OperandDescription
inputranked tensor of any type values
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.relu (tt::ttnn::ReluOp)

Eltwise ReLU.

Eltwise ReLU operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.remainder (tt::ttnn::RemainderOp)

Eltwise remainder.

Performs element-wise remainder of dividend lhs and divisor rhs tensors and produces a result tensor.

Example:

// %lhs: [17, -17, 17, -17] // %rhs: [3, 3, -3, -3] %result = "ttnn.remainder"(%lhs, %rhs) : (tensor<4xi64>, tensor<4xi64>) -> tensor<4xi64> // %result: [2, -2, 2, -2]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.repeat_interleave (tt::ttnn::RepeatInterleaveOp)

Repeat interleave op.

Repeats elements of a tensor along a specified dimension. It allows for flexible repetition patterns, where each element can be repeated a different number of times. This is particularly useful for tasks that require duplicating elements in a non-uniform manner.

Parameters:

  • input: The input tensor.
  • repeats: Specifies the number of repetitions for each element, each element is repeated that number of times.
  • dim: The dimension along which to repeat values.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
repeats::mlir::IntegerAttr32-bit unsigned integer attribute
dim::mlir::IntegerAttr32-bit signed integer attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.repeat (tt::ttnn::RepeatOp)

Repeat op.

Returns a new tensor filled with repetition of input tensor according to number of times specified in repeat_dims.

Parameters:

  • input_tensor (ttnn.Tensor): the input tensor.
  • repeat_dims (number): The number of repetitions for each element.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
repeat_dims::mlir::tt::ttnn::ShapeAttr
TTNN Shape attribute{{% markdown %}} TTNN shape attribute {{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.requantize (tt::ttnn::RequantizeOp)

Requantize operation.

Applies requantization to the input tensor.

Inputs:

  • input AnyRankedTensor: The input tensor to be requantized. Must have quantized element type.
  • in_scale AnyRankedTensor: The input scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
  • in_zero_point AnyRankedTensor: The input zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
  • out_scale AnyRankedTensor: The output scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
  • out_zero_point AnyRankedTensor: The output zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
  • axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.
  • output_dtype Optional<TT_DataTypeAttr>: The data type of the output tensor.
  • memory_config Optional<TTNN_MemoryConfigAttr>: The memory configuration for the output tensor.
// For per-tensor requantization:
output[i] = round((input[i] - input_zero_point) * (input_scale / output_scale)) + output_zero_point
// For per-axis requantization:
output[i0, i1, ..., ia, ..., in] = round((input[i0, i1, ..., ia, ..., in] - in_zero_point[ia]) * (in_scale[ia] / out_scale[ia])) + out_zero_point[ia]

Example:

%input = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%output = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.2>>
%requantized = "ttnn.requantize"(%input, %output) : (tensor<64x128x!quant.uniform<i32:f32, 0.1>, tensor<64x128x!quant.uniform<i32:f32, 0.2>>) -> tensor<64x128x!quant.uniform<i32:f32, 0.2>>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
axis::mlir::IntegerAttr32-bit signless integer attribute
output_dtype::mlir::tt::DataTypeAttrTT DataTypes
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
in_scaleranked tensor of any type values
in_zero_pointranked tensor of any type values
out_scaleranked tensor of any type values
out_zero_pointranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.reshape (tt::ttnn::ReshapeOp)

Reshape op.

Reshape tensor.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shape::mlir::ArrayAttr32-bit integer array attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.rsqrt (tt::ttnn::RsqrtOp)

Eltwise rsqrt.

Eltwise rsqrt operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.scatter (tt::ttnn::ScatterOp)

Scatter op.

Embeds the values of the 'update' tensor into 'input' at the given index and puts the value in the 'output' tensor.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.sigmoid (tt::ttnn::SigmoidOp)

Eltwise sigmoid.

Eltwise sigmoid operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.sign (tt::ttnn::SignOp)

Eltwise sign operation.

Returns the sign of the operand element-wise and produces a result tensor.

Example: %a: [[3, -2, 0], [1, -4, 4]] "ttnn.sign"(%a, %out) -> %out: [[1, -1, 0], [1, -1, 1]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.sin (tt::ttnn::SinOp)

Eltwise sine.

Eltwise sine operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.slice (tt::ttnn::SliceOp)

Slice op.

Extract a portion of a tensor based on the specified start (begins), stop (ends), and step indices for each dimension.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
begins::mlir::ArrayAttr32-bit integer array attribute
ends::mlir::ArrayAttr32-bit integer array attribute
step::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.softmax (tt::ttnn::SoftmaxOp)

Softmax op.

Softmax operation.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dimension::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.sqrt (tt::ttnn::SqrtOp)

Eltwise sqrt.

Eltwise sqrt operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.subtract (tt::ttnn::SubtractOp)

Eltwise subtract.

Eltwise subtract operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
lhsranked tensor of any type values
rhsranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.sum (tt::ttnn::SumOp)

Sum reduction op.

Sum reduction op.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
keep_dim::mlir::BoolAttrbool attribute
dim_arg::mlir::ArrayAttr32-bit integer array attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.tan (tt::ttnn::TanOp)

Eltwise tan op.

Eltwise tan operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.tanh (tt::ttnn::TanhOp)

Eltwise tanh op.

Eltwise tanh operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.to_dtype (tt::ttnn::ToDTypeOp)

ToDType op.

This op converts the data type of the input tensor based on the given data type on the host.

Args:

  • :attr:input: the ttnn.Tensor
  • :attr:dtype: ttnn data type.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dtype::mlir::tt::DataTypeAttrTT DataTypes

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.to_device (tt::ttnn::ToDeviceOp)

ToDevice op.

This op sends the input tensor to the given device with the given memory config.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.to_layout (tt::ttnn::ToLayoutOp)

ToLayout op.

This op wraps all layout information gathered from ttir.toLayout. It is used/updated by the optimizer to perform optimizations, and later broken down into specific memory/layout operations (toDevice, toMemoryConfig etc.). Currently in the TTNN backend, we use this op solely for tilize/untilize, therefore marking all other attrs as optional. Once ttnn::to_layout supports other attrs, we can remove the optional tag.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
layout::mlir::tt::ttnn::LayoutAttrTTNN Layout
dtype::mlir::tt::DataTypeAttrTT DataTypes
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values

ttnn.to_memory_config (tt::ttnn::ToMemoryConfigOp)

ToMemoryConfig op.

This op converts the memory config of the input tensor based on the given memory config. It handles:

  • Dram to L1
  • L1 to Dram
  • Interleaved to sharded
  • Sharded to interleaved
  • Sharded to sharded (reshard)

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.transpose (tt::ttnn::TransposeOp)

Transpose op.

Transpose tensor along two given dimensions.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dim0::mlir::IntegerAttr32-bit signed integer attribute
dim1::mlir::IntegerAttr32-bit signed integer attribute

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.typecast (tt::ttnn::TypecastOp)

Typecast op.

This op converts the data type of the input tensor based on the given data type. It handles:

  • conversions of data types.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
dtype::mlir::tt::DataTypeAttrTT DataTypes

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.update_cache (tt::ttnn::UpdateCacheOp)

Update static cache tensor.

Updates the cache tensor in-place with values from input at update_index and batch_offset.

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Attributes:

AttributeMLIR TypeDescription
batch_offset::mlir::IntegerAttr32-bit signless integer attribute

Operands:

OperandDescription
cacheranked tensor of any type values
inputranked tensor of any type values
update_indexranked tensor of any type values

ttnn.upsample (tt::ttnn::UpsampleOp)

Upsample 2D op.

Upsample 2D operation. Input tensor is assumed to be in NHWC format.

Attributes:

  • scale_factor (si32 | array): The scale factor for upsampling in H and W dimensions respectively.
  • mode (str): The upsampling algorithm. Currently only "nearest" and "bilinear" are supported. Default is "nearest".

Example: // %a: tensor<10x64x32xbf16> %0 = "ttnn.upsample"(%a) <{scale_factor = array<i32: 2, 4>}> : (tensor<10x64x32x3xbf16>) -> tensor<10x128x128x3xbf16>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
scale_factor::mlir::Attribute32-bit signed integer attribute or i32 dense array attribute
mode::mlir::StringAttrstring attribute
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
inputranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.where (tt::ttnn::WhereOp)

Eltwise where.

Eltwise where operation.

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
firstranked tensor of any type values
secondranked tensor of any type values
thirdranked tensor of any type values

Results:

ResultDescription
resultranked tensor of any type values

ttnn.zeros (tt::ttnn::ZerosOp)

Creates a tensor filled with zeros.

Tensor operation to create a tensor filled with zeros.

Given a ShapeAttr shape, produces a tensor with the same shape, filled with zeros.

Example: %0 = "ttnn.zeros"() <{shape = array<i32:64, 28, 28>}> : () -> tensor<64x28x28xbf16> // %0: [[[0, 0, 0, ..., 0], [0, 0, 0, ..., 0], ..., [0, 0, 0, ..., 0]]]

Traits: AlwaysSpeculatableImplTrait, HasMemoryConfigTrait, TT_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

AttributeMLIR TypeDescription
shape::mlir::tt::ttnn::ShapeAttr
TTNN Shape attribute{{% markdown %}} TTNN shape attribute {{% /markdown %}}
dtype::mlir::tt::DataTypeAttrTT DataTypes
layout::mlir::tt::ttnn::LayoutAttrTTNN Layout
memory_config::mlir::tt::ttnn::MemoryConfigAttr
TTNN MemoryConfig attribute{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.
This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, single_bank, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

{{% /markdown %}}

Operands:

OperandDescription
deviceTTNN device

Results:

ResultDescription
resultranked tensor of any type values