Introduction

The following document provides an overview of the TT-MLIR project, with a focus on the technical specifications of an MLIR-based compiler stack. So what exactly is an MLIR-based compiler stack? MLIR (Multi Level Intermediate Representation) is a subproject coming out of the LLVM Project. It seeks to introduce extensibility and sustainable code design to a very modular compiler framework. This essentially means to take a much larger more involved compiler (like LLVM) and split it into sub-compilers that each produce their own Intermediate Representation (IR) of what you've fed the compiler.

Disclaimer: This is intended to be a working document, if you find something incorrect or incomplete please feel free to create a PR.

Motivations

The idea of having a multi-level IR might not seem so far fetched, in fact it resembles some of our current software stacks. The idea of going from a High Level TVM Graph → Lowered PyBUDA Graph → Netlist, with each layer having their own level of optimizations is quite a familiar concept. However, there are problems with the reusability and integration of optimizations for the current software compiler stack. Currently, users are almost forced to choose between a top-down optimization or bottom-up optimization, with both requiring "expert-level" expertise to optimize for desired performance. Developing 2 entirely different projects is taxing, and it's hard to translate the benefits of BUDA over to metal (or the other way around). One of the primary goals of tt-mlir is to enable a consistent programming model between software stacks, concepts for improving optimizations in the compiler stack should 1-1 carry over to hand-written TTNN.

The benefits grow even further when one can understand all the possible entry points that multiple IRs present. Existing MLIR based projects like OpenXLA and torch-mlir can natively output MLIR in a dialect that can be transcribed to the TTIR dialect as well!

What is MLIR and why use it?

MLIR is a compiler infrastructure that is designed to be modular and extensible. The main benefits the tt-mlir project hopes to gain by using MLIR include:

Industry Standard Compiler Framework
- Lots of boilerplate algorithms, data structures, and useful software that is common to compiler development
Ecosystem
- Hook into existing front-end MLIR projects
Testing framework
- A battle-tested test infrastructure that will enable us to write fine grained tests and rely less on end-to-end testing
- Common IR Serialization Format that's easy to test, debug, and edit

Additional documentation to highlight the benefits of MLIR can be found here:

MLIR: Overview

MLIR is at it's root an interpreter that can parse "readable" text in some .mlir format. The unique properties lie in the modularity of the parsing itself. MLIR is built upon a collection of Dialects, each of these Dialects define a collection of Operations, Types, and Attributes. These dialects follow their own syntax, and they can encode any amount of information. The benefit is that MLIR provides bindings and hooks such that a user can directly translate these IRs into usable artifacts for that layer of complexity. An example of this would be the relatively high level TOSA Dialect, which is used to represent computation over tensors, and then lowering that to a more hardware specific dialect that closely models the programming model of the hardware or underlying backend. It is the dialect system itself which powers the multi-level functionality of MLIR, with different dialects a user can essentially "lower" through their software stack by just transforming between the different dialects for their layers. Dialects can exist in a broad range from purely mathematical dialects, to a LinAlg Dialect, or a Tensorflow Dialect defined for ML Graphs. Each dialect encodes it's own information and their operations can use the Types/Attributes of other dialects as parameters. Multiple dialects are possible in one module, and encouraged to highlight optimizations of different dialects. In our usecase for the TT Stack, MLIR acts a "mid-level" compiler which makes the task of joining together various entry points and backends much simpler.

MLIR Primitives

So what does MLIR look like, how does it work and get parsed? The hierarchy of an MLIR module is as shown:

#permutation = array<i64: 0, 2, 1>

module {
  func.func @forward(%input: tensor<32x64x128xf32>) -> tensor<32x128x64xf32> {
    %output = ttir.empty() : tensor<32x128x64xf32>
    %result = "ttir.permute"(%input, %output) <{permutation = #permutation}> : (tensor<32x64x128xf32>, tensor<32x128x64xf32>) -> tensor<32x128x64xf32>
    return %result : tensor<32x128x64xf32>
  }
}

Attributes (defined using #)
- The syntax of actually creating an attribute is modular, and custom assembly instructions for different attributes can be applied.
Operations
- These operations are accessed with the . method, so you'll see some examples like func.func or ttir.empty. Each operation also provides it's own assembly instructions but often strictly defines the type of result
- Quotes are added around ttir.multiply since it's part of a custom dialect.
- Operations typically have operands (arguments) and results which are highlighted with %, these results and operands help to show the relationship between operations
Types
- Types are shown as dataformats throughout this compiled mlir module, where tensor and array are some examples.
- They help to demonstrate the transformation of information and it's representation as it's processed across this module.

MLIR Workflow

The overall MLIR workflow doesn't involve writing .mlir files, not necessarily even modifying them. The Intermediate Representations are truly just representations, we can parse them to demonstrate what the graph looks like at that current stage of optimization, or run a pass through them to optimize certain functions. The overall framework is designed with the following architecture in mind:

Graph Information exists
Graph Information is transformed (through any which method) into a high-level MLIR representation
Passes are run on the high-level implementation to lower into TTIR, a common IR that can be lowered into multiple backends
Depending on the usecase more passes are run to lower to whatever backend the user would like (ex: TTNN Backend)

What are Passes?

Transformations in MLIR are represented as passes that occur during the parsing of some information. These passes can be executed when parsing or generating MLIR modules. These transformations can have a myriad of purposes, and are completely user defined as to how they modify the module. Some examples of passes can be for lowering purposes as mentioned before, where a dialect is parsed and then each operation is transformed to a lowered dialect following some set of user defined rules. Passes are also used for optimizations and backend code transformation in the context of this project. They're a powerful tool and provide most of the functionality to transform between layers of dialects, and they provide a simple platform for modifications of an MLIR module.

Why not make our own?

Now that I've described the functionality of the MLIR framework, it seems like making an in house multi level Intermediate Representation system would be pretty similar, so why are we going through the effort of implementing this framework?

One of the biggest reason can be attributed to the active developer community surrounding the project, being a part of the LLVM Project means that there is solid developer support, and the framework is designed to be a tool for many different paradigms of compute. This scalability and strong mission statement lend to the strengths of MLIR being a solid platform to use as a middle layer in our compiler stack. Furthermore, as a functional benefit of being part of a larger open source project, MLIR has a whole library of tests and infrastructure that we can leverage for solid code health while starting a new project.

Automation

It's not only about developer support, another key benefit of MLIR is that it's built with autogeneration in mind. Through TableGen a lot of the boilerplate of creating this multi-level IR become abstracted away to truly focus on implementation and execution. This automation is built on top of a pre-existing robust framework with a lot of implementations and support from other large players in the ML scene. By integrating with these automation pipelines, we allow for external developers to have a much simpler entry-point into our software stack!

TT-MLIR: Bringing MLIR to the TT Stack

Now that we have defined this pretty cool project, let's look at the implementation details of bringing MLIR (and related optimizations) into the TT Stack. Since it acts as a mid-level compiler we can start by defining the "bottom" and "top" layers of the compiler. BUDA already has a well defined set of frontend optimizations to some TVM defined graph and is knowledgeable of the hardware that these models want to run on. We want to interrupt the BUDA stack to only give us the frontend compiled graph before any hardware specific lowering is to occur. What this will produce is information that is agnostic to different backends and their execution on TT hardware, but this is still valid information to optimize at different levels for later compilation. The "bottom" of our graph is now defined as the backend that will produce the machine-specific code to be executed. While MLIR could allow for any level of complexity downwards for the bottom, we will define a very aggressive TTNN backend for the MVP. Desired Optimization List:

Forge-FE (frontend)
- Graph Optimizations, Constant Folding, Operation Fusion
TT-MLIR (mid-level)
- Data Storage, Memory Configuration, Grid Configuration
TTNN (backend)
- Kernel Configuration*, Network Optimization

*Subject to Change / Be Moved to TT-MLIR

TT-MLIR Dialects

Now that we have defined the series of optimizations that we would like to see implemented in TT-MLIR, we can begin to help define the dialects that would help to support these different levels of optimizations. For more detail on each of these dialects, please refer to the GitHub Wiki and TableGen descriptors. I think that Nick does a great job of documenting the key functionality.

TT Dialect

The TT Dialect is only for common Types and Attributes used throughout the many levels of the mid level compiler.

TTIR Dialect

The TTIR Dialect is defined as the common dialect for TT-MLIR, as such it doesn't define anything hardware/backend specific. It lists out general actions that would take place on TT hardware such as dispatch, layout, and kernel operations.

Generic Operation

This is one of two operations that's crucial to understand the intended optimization characteristics of the TTIR Dialect. The generic operation dictates the actions that would be taken to dispatch some instruction to TT hardware such that it executes some instruction. Parametrically, the operation consumes inputs, outputs, maps to read the tensors, and access-types to the memory. These parameters highlight the optimizations that can be performed at this level to change the location of the memory, transpose using variant access maps, or even the grid upon which the computation takes place. The operation also contains a block in which the exact behaviour for that operation to occur is stored.

Layout Operation

The layout operation is key in describing the storage of memory throughout the execution graph. Layout determines the sharding spec, location of the memory, data types, and tile sizes of some tensor. While generic describes the dispatch for some data-wise transformation to take place, the data itself is laid out across the chip through the layout operation.

Both of these operations describe the key functionality of the TTIR dialect and the optimization space that it provides.

Built-in MLIR Dialects

The functionality of TT-MLIR Dialects also depends / is inspired by the functionality of Built-in MLIR Dialects like Affine and LinAlg. Below are summaries of some of the key members of these Dialects

Affine Dialect

[Reference] Affine maps help to describe transformations on coordinate systems, while this may not really make sense, imagine trying to index a rank 2 tensor. By getting t[x, y] I can access the element in the Xth row and Yth column, but if I wanted to transpose the tensor I might have to re-layout the entire tensor such that the data would be accessible using t[x, y] to get the element in the Yth row and Xth column. This transpose can also be represented using an Affine Map to transform (x, y) -> (y, x) and this would let the tensor data remain in place while the access method is modified. This extends even further to more complex transformations such that stride lengths or unique indexing methods can be implemented without complicated manipulation.

Tensor Dialect

[Reference] The tensor dialect defines the functionality and Type of the fundamental Tensor. This dialect contains members that would represent manipulation and representation of tensors as multi-dimensional data with shapes and datatypes. Not much else is different about this dialect, the reference covers key topics if implementation details are needed.

tt-explorer - Performance Optimization Tool

A unique project related to TT-MLIR is the integration of Performance Optimization Tools such that users are easily able to visualize and readily tune their models without needing an expert level understanding of the tech stack. 'tt-explorer' is built with Google AI's Model Explorer as a base for the visualization tool, and a custom adapter to parse TT-MLIR projects. This would allow users to readily tune their models, and optimize for the TTIR layer (ex: they can change certain memory to be laid out in L1 instead of DRAM, or change the grid layout of an operation to be larger than what was previously assigned). After compilation with these overrides, the runtime information can then be fed directly into a Tracy Performance Analysis for the user to visualize the impacts of their tuning, seeing which operations were least performant and continuing in a gamified design loop for iterative performance tuning!

Getting Started

This page walks you through the steps required to set up tt-mlir.

NOTE: If you have a build issue, you can file a bug here.

Prerequisites

Hardware Setup

Use this guide to set up your hardware - Hardware Setup.

System Dependencies

You can use tt-mlir with Ubuntu or Mac OS, however the runtime does not work on Mac OS. tt-mlir project has the following system dependencies:

Ubuntu 22.04 OS or Mac OS
Clang >= 14 & <= 18
Ninja
CMake 3.24 or higher
Python 3.10
python3.10-venv

Ubuntu

Install Clang, Ninja, CMake, and python3.10-venv:

sudo apt install git clang cmake ninja-build pip python3.10-venv

You should now have the required dependencies installed.

NOTE: If you intend to build with runtime enabled (-DTTMLIR_ENABLE_RUNTIME=ON), you also need to install tt-metal dependencies which can be found here.

Full developer dependencies as packaged in our docker image:

apt-get update
apt-get install -y \
    software-properties-common \
    build-essential \
    python3-dev \
    python3-venv \
    python3-pip \
    git \
    libhwloc-dev \
    pandoc \
    libtbb-dev \
    libcapstone-dev \
    pkg-config \
    linux-tools-generic \
    ninja-build \
    wget \
    libgtest-dev \
    cmake \
    ccache \
    doxygen \
    graphviz \
    libyaml-cpp-dev \
    libboost-all-dev \
    curl \
    jq \
    sudo \
    gh \
    lcov \
    unzip

# Setup / install metal dependencies
wget https://raw.githubusercontent.com/tenstorrent/tt-metal/refs/heads/main/{install_dependencies.sh,tt_metal/sfpi-version.sh}
bash install_dependencies.sh --docker

Mac OS

On MacOS we need to install the latest version of cmake, and ninja which can be done using Homebrew with (Docs for installing Homebrew: https://brew.sh).

brew install cmake ninja

Clone the tt-mlir Repo

Clone the tt-mlir repo:

git clone https://github.com/tenstorrent/tt-mlir.git

Navigate into the tt-mlir folder.

Environment Setup

There are two ways to set up the environment, either using a docker image or building the environment manually. The docker image is recommended since it is easier to set up and use.

Using a Docker Image

Please see Docker Notes for details on how to set up and use the docker image.

Once you have the docker image running and you are logged into the container, you should be ready to build.

Setting up the Environment Manually

This section explains how to manually build the environment so you can use tt-mlir. You only need to build this once, it builds llvm, flatbuffers, and a Python virtual environment. You can specify the LLVM build type by using -DLLVM_BUILD_TYPE=*. The default is MinSizeRel, and available options are listed here.

Navigate into the tt-mlir folder.
The environment gets installed into a toolchain directory, which is by default set to /opt/ttmlir-toolchain, but can be overrideen by setting (and persisting in your environment) the environment variable TTMLIR_TOOLCHAIN_DIR. You need to manually create the toolchain directory as follows:

export TTMLIR_TOOLCHAIN_DIR=/opt/ttmlir-toolchain/
sudo mkdir -p /opt/ttmlir-toolchain
sudo chown -R $USER /opt/ttmlir-toolchain

Please ensure that you do not already have an environment (venv) activated before running the following commands:

cmake -B env/build env
cmake --build env/build
source env/activate

NOTE: The last command takes time to run, so give it time to complete.

Building the tt-mlir Project

In this step, you build the tt-mlir project:

source env/activate
cmake -G Ninja -B build
cmake --build build

You have now configured tt-mlir.

You can add different flags to your build. Here are some options to consider:

To enable the ttnn/metal runtime add -DTTMLIR_ENABLE_RUNTIME=ON. Clang 17 is the minimum required version when enabling the runtime.
To enable the ttnn/metal perf runtime add -DTT_RUNTIME_ENABLE_PERF_TRACE=ON.
To accelerate the builds with ccache use -DCMAKE_CXX_COMPILER_LAUNCHER=ccache.
To workaround OOM issues it can be useful to decrease the number of parallel jobs with -DCMAKE_BUILD_PARALLEL_LEVEL=4.
If Python bindings aren't required for your project, you can accelerate builds further with the command -DTTMLIR_ENABLE_BINDINGS_PYTHON=OFF.
To enable tt-explorer add the -DTT_RUNTIME_ENABLE_PERF_TRACE=ON, -DTTMLIR_ENABLE_RUNTIME=ON, and -DTT_RUNTIME_DEBUG=ON.
To enable optimizer pass that uses the op model library, add -DTTMLIR_ENABLE_OPMODEL=ON.
The TTNN build is automatically integrated / handled by the tt-mlir cmake build system. For debugging and further information regarding the TTNN backend build step, please refer to TTNN Documentation.
The runtime build depends on the TT_METAL_HOME variable, which is also set in env/activate script. For more information, please refer to TT-NN and TT-Metailium installation documentation.

OS	Offline Compiler Only	Runtime Enabled Build	Runtime + Perf Enabled Build
Ubuntu 22.04	✅	✅	✅
Ubuntu 20.04	✅	❌	❌
MacOS	✅	❌	❌

Test the Build

Use this step to check your build. Do the following:

source env/activate
cmake --build build -- check-ttmlir

Lint

Set up lint so you can spot errors and stylistic issues before runtime:

source env/activate
cmake --build build -- clang-tidy

Note for developers: You can run:
source env/activate
cmake --build build -- clang-tidy-ci
This reproduces the Lint (clang-tidy) CI job. It runs clang-tidy only on committed files that have been modified relative to the origin/main branch.

Pre-Commit

Pre-Commit applies a git hook to the local repository such that linting is checked and applied on every git commit action. Install from the root of the repository using:

source env/activate
pre-commit install

If you have already committed before installing the pre-commit hooks, you can run on all files to "catch up":

pre-commit run --all-files

For more information visit pre-commit

Docs

Build the documentation by doing the following:

Make sure you have mdbook, doxygen, sphinx, and sphinx-markdown-builder installed.
Build the docs:

source env/activate
cmake --build build -- docs
mdbook serve build/docs

NOTE: mdbook serve will by default create a local server at http://localhost:3000.

For more information about building the docs please read the full guide on building the docs.

Common Build Errors

`TTMLIRPythonCAPI target requires changing an RPATH`

CMake Error at /opt/ttmlir-toolchain/lib/cmake/llvm/AddLLVM.cmake:594 (add_library):
  The install of the TTMLIRPythonCAPI target requires changing an RPATH from
  the build tree, but this is not supported with the Ninja generator unless
  on an ELF-based or XCOFF-based platform.  The
  CMAKE_BUILD_WITH_INSTALL_RPATH variable may be set to avoid this relinking
  step.

If you get the above error, it means you tried to build with an old version of cmake or ninja and there is a stale file. To fix this, rm -rf your build directory, install a newer version of cmake/ninja, and then rebuild. If you installed ninja via sudo apt install ninja-build, it might still be not up-to-date (v1.10.0). You may use ninja in the python virtual environment, or install it via pip3 install -U ninja, either way the version 1.11.1.git.kitware.jobserver-1 should work.

`clang++ is not a full path and was not found in the PATH`

CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_CXX_COMPILER:
    clang++
  is not a full path and was not found in the PATH.
  Tell CMake where to find the compiler by setting either the environment
  variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
  to the compiler, or to the compiler name if it is in the PATH.
CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_C_COMPILER:
    clang
  is not a full path and was not found in the PATH.
  Tell CMake where to find the compiler by setting either the environment
  variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
  the compiler, or to the compiler name if it is in the PATH.

If you get the following error, it means you need to install clang which you can do with sudo apt install clang on Ubuntu.

tt-metal Update Failures

Failed to unstash changes in: '/path/to/tt-metal/src/tt-metal'
You will have to resolve the conflicts manually

This error occurs during CMake's ExternalProject update of tt-metal. The build system tries to apply changes using Git's stash mechanism, but fails due to conflicts. This can happen even if you haven't manually modified any files, as the build process itself may leave behind artifacts or partial changes from previous builds.

To resolve, run the following command:

rm -rf third_party/tt-metal

Then retry your build command. If the error persists, you may need to do the following:

Remove the build directory: rm -rf build
Run CMake commands again.
Run the above.

Common Runtime Errors

Debugging Python on Mac OS

When debugging python on macOS via lldb you may see an error like:

(lldb) r
error: process exited with status -1 (attach failed (Not allowed to attach to process.  Look in the console messages (Console.app), near the debugserver entries, when the attach failed.  The subsystem that denied t
he attach permission will likely have logged an informative message about why it was denied.))

For preinstalled macOS binaries you must manually codesign with debug entitlements.

Create file debuggee-entitlement.xml:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
        <key>com.apple.security.cs.disable-library-validation</key>
        <true/>
        <key>com.apple.security.get-task-allow</key>
        <true/>
</dict>
</plist>

Sign the binary:

sudo codesign -f -s - --entitlements debuggee-entitlement.xml /opt/ttmlir-toolchain/venv/bin/python

Working with Docker Images

Components:

Dockerfile
Workflow for building Docker image
Project build using Docker image

Overview

We use docker images to prepare the project environment, install dependencies, tooling and prebuild toolchain. Project builds four docker images:

Base image tt-mlir-base-ubuntu-22-04 Dockerfile.base
CI image tt-mlir-ci-ubuntu-22-04 Dockerfile.ci
Base IRD image tt-mlir-base-ird-ubuntu-22-04Dockerfile.ird
IRD image tt-mlir-ird-ubuntu-22-04 Dockerfile.ird

Base image starts with a supported base image (Ubuntu 22.04) and installs dependencies for project build. From there, we build the CI image that contains the prebuild toolchain and is used in CI to shorten the build time. The IRD image contains dev tools such as GDB, vim and ssh which are used in IRD environments.

During the CI Docker build, the project is built and tests are run to ensure that everything is set up correctly. If any dependencies are missing, the Docker build will fail.

Using the Docker Image

Here is a typical command to run the latest developer (ird) docker image:

sudo docker run -it -d --rm \
  --name my-docker \
  --cap-add ALL \
  --device /dev/tenstorrent/0:/dev/tenstorrent/0 \
  -v /dev/hugepages:/dev/hugepages \
  -v /dev/hugepages-1G:/dev/hugepages-1G \
  ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest bash

Special attention should be paid to flags:

--device /dev/tenstorrent/0:/dev/tenstorrent/0: this is required to map the hardware device into the container. For machines with multiple devices, this flag can be specified multiple times or adjusted with the appropriate device number.

-v /dev/hugepages:/dev/hugepages / -v /dev/hugepages-1G:/dev/hugepages-1G: this is required to map the hugepages volume into the container. For more information on hugepages, please refer to the Getting Started Guide.

The base or CI image can also be used in the same way, but the IRD image is recommended for development.

Using the Docker Image via IRD (Internal Developers Only)

Internally we use a tool called IRD. As part of your reserve command, you can specify the docker image to use:

ird reserve \
  --docker-image ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest

See ird reserve --help for more information on the reserve command. Typical ird usage might look like:

# list machine availability
ird list-machines

# reserve a machine
ird reserve \
  --volumes /localdev/$USER:/localdev/$USER \
  --docker-image ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest \
  --timeout 720 \
  wormhole_b0 \
  --machine [MACHINE_NAME]

# list your currently reserved machines
ird list

# connect to the first reserved machine
ird connect-to 1

# release the first reserved machine
ird release 1

Building the Docker Image using GitHub Actions

The GitHub Actions workflow Build and Publish Docker Image builds the Docker images and uploads them to GitHub Packages at https://github.com/orgs/tenstorrent/packages?repo_name=tt-mlir. We use the git SHA we build from as the tag.

Building the Docker Image Locally

To test the changes and build the image locally, use the following command:

docker build -f .github/Dockerfile.base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ci-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird --build-arg FROM_IMAGE=base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird --build-arg FROM_IMAGE=ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest .

Using the Image in GitHub Actions Jobs

The GitHub Actions workflow Build in Docker uses a Docker container for building:

    container:
      image: ghcr.io/${{ github.repository }}/tt-mlir-ci-ubuntu-22-04:latest
      options: --user root

Running Virtualized Ubuntu VM on macOS

In some cases, like running a software simulated device, it can be beneficial to run the stack on a local macOS machine. This document covers the necessary setup and configuration steps to get a performant Ubuntu VM setup on Apple Silicon.

Prerequisite Steps

UTM is the VM application we'll be using in this guide, so the first step is to download and install UTM.
Ubuntu 22.04 ARM image download.

Direct link: 64-bit ARM (ARMv8/AArch64) server install image

UTM Setup

Launch UTM and click the + button to start a new VM.
Choose Virtualize (emulation works, but is unusably slow).
Under Preconfigured choose Linux.
Check box Use Apple Virtualization and select the ubuntu iso we just downloaded for Boot ISO Image.

Optionally check Enable Rosetta which can enable running ELF's compiled for x86 if you're interested. It's not required and additional steps are required for it to work.

This step depends on your machine's capabilities, but it's recommended to give 16GB of memory and to use the defualt CPU Cores setting. Note this can be changed after initial setup if you want to go back and tweak settings.
It's recommended to at least 128GB of storage, with LLVM installation and full SW stack we quickly reach 80 gigs of storage.
Optionally choose a shared host/VM directory.
Optionally name your new VM ubuntu 22.04 arm64

VM Setup

Boot your newly created VM!
Run through the Ubuntu setup as you see fit, be sure that openssh is enabled which simplifies logging into your VM, but the rest of the defaults are sufficient.
If you plan on using your VM via ssh you can retrieve the ip address ip a and looking at the inet row under enp0s1. Should look something like inet 192.168.64.3. Another tip is to add this to the host's ~/.ssh/config.
Install your normal developer tools as you see fit.

Software Stack Installation

The majority of the software install flow is the same, with the exception of a few caveats called out here.

Installing metal deps needs the additional flags below:

git clone git@github.com:tenstorrent/tt-metal.git
cd tt-metal
sudo bash install_dependencies.sh --docker --no-distributed

--docker: Despite not being in a docker, this is the flag that turns off configuring hugepages which is not required for VM.
--no-distributed: Currently the metal distributed feature requires a package version of openmpi that only supports x86.

Install tt-mlir system dependencies as outlined by this step.
The environment needs to be built manually as outlined here.
We can then build tt-mlir per usual.
If planning to run tests on software sim, let's build the ttrt tool.

The following all works per usual:
- Querying a system desc.
- Running a flatbuffer

Testing

To run tests:

source env/activate
cmake --build build -- check-ttmlir

Lit testing

llvm-lit tool is used for MLIR testing. With it you can:

# Query which tests are available
llvm-lit -sv ./build/test --show-tests

# Run an individual test:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR/test_allocate.mlir

# Run a sub-suite:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR

See the full llvm-lit documentation for more information.

EmitC testing

NOTE: This is a developer's guide on how to test EmitC as a feature. For usage of EmitC, please refer to `ttnn-standalone` docs.

Prerequisites

Built ttmlir
Built ttrt
Activated virtual environment:
```
source env/activate
```
Saved system descriptor file:
```
ttrt query --save-artifacts
```

Generate all EmitC tests and run them

Generate flatbuffers and .cpp files for EmitC tests

If you don't have SYSTEM_DESC_PATH environment variable exported, you can run:
```
SYSTEM_DESC_PATH=/path/to/system_desc.ttsys llvm-lit -sv test/ttmlir/EmitC/TTNN
```
Or if you have SYSTEM_DESC_PATH exported, you can omit it:
```
llvm-lit -sv test/ttmlir/EmitC/TTNN
```
Compile generated .cpp files to shared objects
```
tools/ttnn-standalone/ci_compile_dylib.py
```
Run flatbuffers + shared objects and compare results
```
ttrt run --emitc build/test/ttmlir/EmitC/TTNN
```

Generate a single EmitC test and run it

Generate flatbuffers and .cpp files for EmitC test

SYSTEM_DESC_PATH=/path/to/system_desc.ttsys llvm-lit -sv test/ttmlir/EmitC/TTNN/eltwise_binary/add.mlir

Compile generated .cpp files to shared objects

Assuming default build directory path:

tools/ttnn-standalone/ci_compile_dylib.py --file build/test/ttmlir/EmitC/TTNN/eltwise_binary/add.mlir.cpp

Run the flatbuffer + shared object and compare results

ttrt run --emitc build/test/ttmlir/EmitC/TTNN/eltwise_binary/add.mlir.ttnn

Tools

The ttmlir project currenly exposes the following tools:

ttmlir-opt: The ttmlir optimizer driver. This tool is used to run the ttmlir compiler passes on a .mlir source files and is central to developing and testing the compiler.
ttmlir-translate: The ttmlir translation tool. This tool can convert from IR to external representation (and inverse). For example, IR in EmitC dialect can be converted into C++ code.
ttrt: This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files.
ttir-builder: This tool is for creating ttir operations. It provides support for those ops to be compiled into modules or directly to flatbuffer files.
tt-explorer: Visualizer tool for ttmlir-powered compiler results. Visualizes from emitted .mlir files to display compiled model, attributes, performance results, and provide a platform for human-driven overrides to gamify model tuning.
ttnn-standalone: This tool is used to run C++ TTNN code outside of the compiler environment.

`ttmlir-opt`

The ttmlir optimizer driver. This tool is used to run the ttmlir compiler passes on a .mlir source files and is central to developing and testing the compiler.

Simple Test

./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir
# Or
./build/bin/ttmlir-opt --ttir-to-ttmetal-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir

`ttmlir-translate`

The ttmlir-translate translation utility. Unlike ttmlir-opt tool which is used to run passes within the MLIR world, ttmlir-translate allows us to ingest something (e.g. code) into MLIR world, and also produce something (e.g. executable binary, or even code again) from MLIR.

Generate C++ code from MLIR

# First, let's run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-to-emitc-pipeline test/ttmlir/Dialect/TTNN/eltwise/binary/multiply/simple_multiply.mlir -o c.mlir

# Now run `ttmlir-translate` to produce C++ code
./build/bin/ttmlir-translate --mlir-to-cpp c.mlir

Bonus: These two commands can be piped, to avoid writing a mlir file to disk, like so:

./build/bin/ttmlir-opt --ttir-to-emitc-pipeline test/ttmlir/Dialect/TTNN/eltwise/binary/multiply/simple_multiply.mlir | ./build/bin/ttmlir-translate -mlir-to-cpp

Generate flatbuffer file from MLIR

# First run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/eltwise/binary/multiply/simple_multiply.mlir -o ttnn.mlir

# Now run `ttmlir-translate` to produce flatbuffer file
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn

`ttrt`

This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files. It enables the running of flatbuffer files without a front-end runtime.

Building

Build ttmlir
Build ttrt:

source env/activate
cmake --build build -- ttrt
ttrt --help

Building runtime mode

Add the following flags when building the compiler

-DTTMLIR_ENABLE_RUNTIME=ON

Building perf mode

Add the following flags when building the compiler

-DTTMLIR_ENABLE_RUNTIME=ON
-DTT_RUNTIME_ENABLE_PERF_TRACE=ON

LOGGER Levels

ttrt support logging at different logger levels. You will need to set env var TTRT_LOGGER_LEVEL in command line or a python script. By default, it's set to INFO.

TTRT_LOGGER_LEVEL=INFO
TTRT_LOGGER_LEVEL=CRITICAL
TTRT_LOGGER_LEVEL=ERROR
TTRT_LOGGER_LEVEL=WARNING
TTRT_LOGGER_LEVEL=DEBUG

tt-metal logging

ttrt runtime uses tt-metal for op execution and device interfacing. For more detailed logs, which can help in troubleshooting build or runtime issues, set env var TT_METAL_LOGGER_LEVEL. By default, it is set to FATAL.

export TT_METAL_LOGGER_LEVEL=DEBUG

Installing `ttrt` as python whls

Every time ttrt is built, it creates a whls file in build/runtime/tools/ttrt/build. Ex filename: ttrt-0.0.235-cp310-cp310-linux_x86_64.whl. You can take this whls file and install it in any docker container and in any venv outside of ttmlir. After which, you can use all the following functionality as the same.

Download whls
Create a python venv

python -m venv ttrt_env
source ttrt_env/bin/activate

Install whls (replace with your version of the whls)

pip install build/runtime/tools/ttrt/build/ttrt-0.0.235-cp310-cp310-linux_x86_64.whl

Generating a flatbuffer

tt-mlir exposes a few ways to generate flatbuffers.

Generate a flatbuffer file from ttir-builder

ttir-builder is a tool for creating TTIR ops, converting them into MLIR modules, running passes to lower modules into backends, and translating to flatbuffers. See documentation for further instructions.

Generate a flatbuffer file from compiler

The compiler supports a pass to load a system descriptor to compile against. You can feed this pass into ttmlir-opt.

Build ttmlir
Build ttrt (see building section on this page)
Generate ttsys file from the system you want to compile for using ttrt. This will create a system_desc.ttsys file under ttrt-artifacts folder.

ttrt query --save-artifacts

Use ttmlir-opt tool in compiler to feed system descriptor. See the ttmlir-opt documentation for more information on how to generate .mlir files.

./build/bin/ttmlir-opt --ttcore-register-device="system-desc-path=/path/to/system_desc.ttsys" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
or (pipe path directly into ttir-to-ttnn-backend-pipeline)
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=/path/to/system_desc.ttsys" test/ttmlir/Dialect/TTNN/simple_subtract_to_add.mlir -o ttnn.mlir

Use ttmlir-translate tool in compiler to generate the flatbuffer executable. See the ttmlir-translate documentation for more information on how to generate flatbuffer files.

./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn

Run your test cases using ttrt

ttrt run /path/to/out.ttnn

Generate flatbuffer files using llvm-lit

There are already existing .mlir test cases under test/ttmlir/Silicon. You can use llvm-lit tool to generate the corresponding ttnn and ttm files.

Build ttmlir
Build ttrt (see building section on this page)
Generate ttsys file from the system you want to compile for using ttrt. This will create a system_desc.ttsys file under ttrt-artifacts folder.

ttrt query --save-artifacts

Export this file in your environment using export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys. When llvm-lit is run, it will query this variable and generate the ttnn and ttm files using this system. Optionally, you can also provide this manually when running llvm-lit.
Generate your test cases. This will generate all your ttnn and ttm files under build/test/ttmlir/Silicon. ttnn files have a .ttnn file extension and ttmetal files have a .ttm extension.

cmake --build build -- check-ttmlir

(Optional) If you have a single .mlir file (or a directory of custom .mlir files) that you created using the compiler, and you want to generate the corresponding ttnn and ttm files for it, you can run llvm-lit standalone to the path of your .mlir file or directory of .mlir files to generate the flatbuffer executables. You will have to make sure you add in the correct llvm-lit configs into your .mlir file. See section on adding llvm-lit config options inside a .mlir file to create flatbuffer binaries for more info. You must also make sure your .mlir test is found within test/ttmlir/Silicon folder (and point lit to the build folder)!

llvm-lit -v ./build/test/ttmlir/Silicon
or
SYSTEM_DESC_PATH=/path/to/system_desc.ttsys llvm-lit -v ./build/test/ttmlir/Silicon

Run your test cases using ttrt

ttrt run /path/to/test.ttnn
ttrt run /path/to/dir/of/flatbuffers

Adding llvm-lit config options inside a .mlir file to create flatbuffer binaries

Inside of your .mlir file, you can add certain config options that llvm-lit will use when running against that test case. For the purpose of generating flatbuffer executables, you can add --ttcore-register-device="system-desc-path=%system_desc_path%" which will tell llvm-lit to parse the system desc found from the environment flag set by export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys. You can also paste a custom path to a system desc file as well.

// RUN: ttmlir-opt --ttcore-register-device="system-desc-path=%system_desc_path%" --ttnn-layout --convert-ttir-to-ttnn %s  > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn

Adding new mlir test cases

You can copy your .mlir test file (with the appropriate llvm-lit config options for generating flatbuffer binaries) into test/ttmlir/Silicon. Then, follow generating flatbuffer files using llvm-lit to generate the executables to run!

Versioning

ttrt and flatbuffers have strict versioning check. When running a flatbuffer against ttrt, you have to make sure the flatbuffer was generated using the same version as ttrt (or vice versa). Major and Minor versions are manually set using github tags when releases are made. Patch versioning is the number of commits from the last major/minor tag.

vmajor.minor.patch

The flag --ignore-version can be used to bypass versioning checks. Use at your own risk; it can cause unpredictable errors.

Application APIs

ttrt --help
ttrt read
ttrt run
ttrt query
ttrt perf
ttrt check

Command line usage

There are different ways you can use the APIs under ttrt. The first is via the command line as follows. All artifacts are saved under ttrt-artifacts folder under TT_MLIR_HOME environment variable. By default, all logging is printed to the terminal. You can specify a log file to dump output to.

read

Read sections of a binary file

ttrt read --help
ttrt read --section version out.ttnn
ttrt read --section system_desc out.ttnn
ttrt read --section mlir out.ttnn
ttrt read --section inputs out.ttnn
ttrt read --section outputs out.ttnn
ttrt read --section op_stats out.ttnn
ttrt read --section mesh_shape out.ttnn
ttrt read --section all out.ttnn --clean-artifacts
ttrt read --section all out.ttnn --save-artifacts
ttrt read --section all /dir/of/flatbuffers
ttrt read system_desc.ttsys
ttrt read --section system_desc system_desc.ttsys
ttrt read system_desc.ttsys --log-file ttrt.log
ttrt read out.ttnn --save-artifacts --artifact-dir /path/to/some/dir
ttrt read out.ttnn --result-file result.json

run

Run a binary file or a directory of binary files Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt run --help
ttrt run out.ttnn
ttrt run out.ttnn --seed 0
ttrt run out.ttnn --init arange
ttrt run out.ttnn --identity
ttrt run out.ttnn --identity --rtol 1 --atol 1
ttrt run out.ttnn --clean-artifacts
ttrt run out.ttnn --save-artifacts
ttrt run out.ttnn --loops 10
ttrt run --program-index all out.ttnn
ttrt run --program-index 0 out.ttnn
ttrt run /dir/of/flatbuffers
ttrt run /dir/of/flatbuffers --loops 10
ttrt run /dir/of/flatbuffers --log-file ttrt.log
ttrt run out.ttnn --save-artifacts --artifact-dir /path/to/some/dir
ttrt run out.ttnn --load-kernels-from-disk
ttrt run out.ttnn --result-file result.json
ttrt run out.ttnn --disable-golden
ttrt run out.ttnn --save-golden-tensors
ttrt run out.ttnn --print-input-output-tensors
ttrt run out.ttnn --debugger
ttrt run out.ttnn --memory --save-artifacts
ttrt run out.ttnn --memory --check-memory-leak

For info on running EmitC tests, see EmitC testing.

Run results

The run api saves a run_results.json file that records information about the run including any errors that were thrown and location of other saved run data.

{
[
  {
    "file_path": "ttnn/test_tan[f32-shape0]_ttnn.mlir.ttnn",
    "result": "pass",
    "exception": "",
    "log_file": "ttrt.log",
    "artifacts": "/home/$USER/tt-mlir/ttrt-artifacts",
    "program_index": "all",
    "program_results": {
      "program_index_0": {
        "loop_0": {
          "total_duration_ns": 3269341588,
          "total_ttnn_api_duration_ns": null,
          "total_device_kernel_duration_ns": null
        }
      }
    }
  }
]

Golden checks

Golden checks are used to verify runtime op accuracy. They are run by default during the golden callback unless flag --disable-golden is used. If flag --save-artifacts is used, a golden results report will be saved under the artifacts directory.

{
    "loc(\"/home/$USER/tt-mlir/test/python/golden/test_ttir_ops.py:74:id(0)\")": {
        "expected_pcc": 0.99,
        "actual_pcc": 0.0015917614829425491,
        "atol": 1e-08,
        "rtol": 1e-05,
        "allclose": false,
        "max": 8529.765625,
        "mean_absolute_error": 6.644593238830566,
        "root_mean_square_error": 100.30211639404297,
        "cosine_similarity": 0.0016297339461743832
    }
}

Memory

Memory callback functions are run when flag --memory is used. A memory report will be written under the artifacts directory that contains information on op memory usage.

{
    "0": {
        "loc": "loc(\"/home/$USER/tt-mlir/test/python/golden/test_ttir_ops.py:74:id(0)\")",
        "debug_str": "%0 = \"ttnn.tan\"(%arg0) : (tensor<128x128xf32, #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<4x4x!ttcore.tile<32x32, f32>, #ttnn.buffer_type<dram>>, <interleaved>>>) -> tensor<128x128xf32, #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<4x4x!ttcore.tile<32x32, f32>, #ttnn.buffer_type<dram>>, <interleaved>>> loc(\"/home/$USER/tt-mlir/test/python/golden/test_ttir_ops.py:74:id(0)\")",
        "dram": {
            "num_banks": 12,
            "total_bytes_per_bank": 1071181792,
            "total_bytes_allocated_per_bank": 16384,
            "total_bytes_free_per_bank": 1071167456,
            "largest_contiguous_bytes_free_per_bank": 1071165408,
            "block_table": [
                {
                    "allocated": "yes",
                    "nextID": "1",
                    "prevID": "-1",
                    "size": "8192",
                    "address": "0",
                    "blockID": "0"
                },
                {
                    "allocated": "yes",
                    "nextID": "3",
                    "prevID": "0",
                    "size": "8192",
                    "address": "8192",
                    "blockID": "1"
                },
                {
                    "allocated": "no",
                    "nextID": "-1",
                    "prevID": "1",
                    "size": "1071165408",
                    "address": "16384",
                    "blockID": "3"
                }
            ]
        },
        "l1": {
            "num_banks": 64,
            "total_bytes_per_bank": 1369120,
            "total_bytes_allocated_per_bank": 0,
            "total_bytes_free_per_bank": 1369120,
            "largest_contiguous_bytes_free_per_bank": 1369120,
            "block_table": [
                {
                    "allocated": "no",
                    "nextID": "-1",
                    "prevID": "-1",
                    "size": "1369120",
                    "address": "0",
                    "blockID": "0"
                }
            ]
        },
        "l1_small": {
            "num_banks": 64,
            "total_bytes_per_bank": 32768,
            "total_bytes_allocated_per_bank": 0,
            "total_bytes_free_per_bank": 32768,
            "largest_contiguous_bytes_free_per_bank": 32768,
            "block_table": [
                {
                    "allocated": "no",
                    "nextID": "-1",
                    "prevID": "-1",
                    "size": "32768",
                    "address": "0",
                    "blockID": "0"
                }
            ]
        },
        "trace": {
            "num_banks": 12,
            "total_bytes_per_bank": 0,
            "total_bytes_allocated_per_bank": 0,
            "total_bytes_free_per_bank": 0,
            "largest_contiguous_bytes_free_per_bank": 0,
            "block_table": [
                {
                    "allocated": "no",
                    "nextID": "-1",
                    "prevID": "-1",
                    "size": "0",
                    "address": "0",
                    "blockID": "0"
                }
            ]
        }
    }
}

Debugger

Enabling the --debugger flag sets a pbd trace to run after each op during the callback hook.

query

Query the system to obtain the system desc file (optionally store it to disk) Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt query --help
ttrt query
ttrt query --quiet
ttrt query --save-artifacts
ttrt query --clean-artifacts
ttrt query --save-artifacts --log-file ttrt.log
ttrt query --save-artifacts --artifact-dir /path/to/some/dir
ttrt query --result-file result.json

perf

Run performance mode of a binary file or a directory of binary files Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON. Also need perf enabled build -DTT_RUNTIME_ENABLE_PERF_TRACE=ON. Note: You can collect host only related performance data via --host-only flag. By default, host and device side performance data are both collected. If the saving artifacts flag is provided, perf mode will dump the following files in the artifacts directory

ops_perf_results.csv : compiled op performance results

OP CODE,OP TYPE,GLOBAL CALL COUNT,DEVICE ID,ATTRIBUTES,MATH FIDELITY,CORE COUNT,PARALLELIZATION STRATEGY,HOST START TS,HOST END TS,HOST DURATION [ns],DEVICE FW START CYCLE,DEVICE FW END CYCLE,OP TO OP LATENCY [ns],OP TO OP LATENCY BR/NRISC START [ns],DEVICE FW DURATION [ns],DEVICE KERNEL DURATION [ns],DEVICE KERNEL DURATION DM START [ns],DEVICE KERNEL DURATION PER CORE MIN [ns],DEVICE KERNEL DURATION PER CORE MAX [ns],DEVICE KERNEL DURATION PER CORE AVG [ns],DEVICE KERNEL FIRST TO LAST START [ns],DEVICE BRISC KERNEL DURATION [ns],DEVICE NCRISC KERNEL DURATION [ns],DEVICE TRISC0 KERNEL DURATION [ns],DEVICE TRISC1 KERNEL DURATION [ns],DEVICE TRISC2 KERNEL DURATION [ns],DEVICE ERISC KERNEL DURATION [ns],DEVICE COMPUTE CB WAIT FRONT [ns],DEVICE COMPUTE CB RESERVE BACK [ns],DISPATCH TOTAL CQ CMD OP TIME [ns],DISPATCH GO SEND WAIT TIME [ns],INPUT_0_W,INPUT_0_Z,INPUT_0_Y,INPUT_0_X,INPUT_0_LAYOUT,INPUT_0_DATATYPE,INPUT_0_MEMORY,OUTPUT_0_W,OUTPUT_0_Z,OUTPUT_0_Y,OUTPUT_0_X,OUTPUT_0_LAYOUT,OUTPUT_0_DATATYPE,OUTPUT_0_MEMORY,METAL TRACE ID,METAL TRACE REPLAY SESSION ID,COMPUTE KERNEL SOURCE,COMPUTE KERNEL HASH,DATA MOVEMENT KERNEL SOURCE,DATA MOVEMENT KERNEL HASH,BRISC MAX KERNEL SIZE [B],NCRISC MAX KERNEL SIZE [B],TRISC 0 MAX KERNEL SIZE [B],TRISC 1 MAX KERNEL SIZE [B],TRISC 2 MAX KERNEL SIZE [B],ERISC MAX KERNEL SIZE [B],PM IDEAL [ns],PM COMPUTE [ns],PM BANDWIDTH [ns],PM REQ I BW,PM REQ O BW,PM FPU UTIL (%),NOC UTIL (%),DRAM BW UTIL (%),NPE CONG IMPACT (%),LOC,CONST_EVAL_OP,PROGRAM_METADATA
UnaryDeviceOperation,tt_dnn_device,1024,0,{'bfp8_pack_precise': 'false'; 'fp32_dest_acc_en': 'true'; 'op_chain': '{UnaryWithParam(op_type=UnaryOpType::TAN;param={})}'; 'output_dtype': 'DataType::FLOAT32'; 'output_memory_config': 'MemoryConfig(memory_layout=TensorMemoryLayout::INTERLEAVED;buffer_type=BufferType::DRAM;shard_spec=std::nullopt;nd_shard_spec=std::nullopt;created_with_nd_shard_spec=0)'; 'preserve_fp32_precision': 'true'},HiFi4,16,,4556959654,4557518500,558846,9815181939513,9815181946491,0,0,6978,6314,6126,4982,6216,5652,335,6087,1375,1656,4957,465,,,,,,1,1,128,128,TILE,FLOAT32,DEV_1_DRAM_INTERLEAVED,1,1,128,128,TILE,FLOAT32,DEV_1_DRAM_INTERLEAVED,,,['ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/compute//eltwise_sfpu.cpp'],['eltwise_sfpu/3265258334475852953/'],['ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/dataflow/reader_unary_interleaved_start_id.cpp'; 'ttnn/cpp/ttnn/operations/eltwise/unary/device/kernels/dataflow/writer_unary_interleaved_start_id.cpp'],['reader_unary_interleaved_start_id/1146610629329498539/'; 'writer_unary_interleaved_start_id/1727642094059197364/'],708,736,1344,1568,1380,0,1,1,1,[],[],0.016,,,,"loc(""/home/$USER/tt-mlir/test/python/golden/test_ttir_ops.py:74:id(0)"")",false,"{'loop_number': 0, 'program_index': 0, 'disable_eth_dispatch': False, 'enable_program_cache': False, 'dump_device_rate': 1000}"

profile_log_device.csv : dump of all device side profiled results
tracy_ops_data.csv : op data results dumped in a readable format
tracy_ops_times.csv : op time results dumped in a readable format
tracy_profile_log_host.tracy : tracy profiled results file, this file can be fed into the tracy GUI

check

Check a binary file or a directory of binary files against a system desc (by default, uses the host machine) Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON.

ttrt check --help
ttrt check out.ttnn
ttrt check out.ttnn --system-desc /path/to/system_desc.ttsys
ttrt check out.ttnn --clean-artifacts
ttrt check out.ttnn --save-artifacts
ttrt check out.ttnn --log-file ttrt.log
ttrt check /dir/of/flatbuffers --system-desc /dir/of/system_desc
ttrt check --save-artifacts --artifact-dir /path/to/some/dir out.ttnn
ttrt check out.ttnn --result-file result.json

gdb

You can relaunch ttrt inside of gdb which can be useful for debugging C++ runtime components.

ttrt --gdb run ...
ttrt --gdb perf ...

Using as a python package

The other way to use the APIs under ttrt is importing it as a library. This allows the user to use it in custom scripts.

Import `ttrt` as a python package

from ttrt.common.api import API

Setup API and register all features

API.initialize_apis()

Setup arguments

You can specify certain arguments to pass to each API, or use the default arguments provided

Args

This can be a dictionary of values to set inside your API instance. These are the same options as found via the command line. You can get the total list of support arguments via the ttrt --help command. Any argument not provided will be set to the default.

custom_args = {}
custom_args["--clean-artifacts"] = True
query_instance = API.Query(args=custom_args)

Logging

You can specify a specific logging module you want to set inside your API instance. The rationale behind this is to support different instances of different APIs, all being able to be logged to a different file. You can also customize the level of detail your log file contains.

from ttrt.common.util import Logger
import os

os.environ["LOGGER_LEVEL"] = "DEBUG"
log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)
read_instance = API.Read(logger=custom_logger)

Artifacts

You can specify a specific artifacts directory to store all the generate metadata during the execution of any API run. This allows you to specify different artifact directories if you wish for different instances of APIs.

from ttrt.common.util import Artifacts

log_file_name = "some_file_name.log"
artifacts_folder_path = "/opt/folder"
custom_logger = Logger(log_file_name)
custom_artifacts = Artifacts(logger=custom_logger, artifacts_folder_path=artifacts_folder_path)
run_instance = API.Run(artifacts=custom_artifacts)

Execute API

Once all the arguments are setup, you can run your API instance with all your provided arguments. Note, APIs are stateless. Thus, subsequent calls to the same API instance will not preserve previous call artifacts. You can generate a new artifacts directory for subsequent runs if you wish to call the APIs multiple times, for example.

result_code, results = query_instance()
result_code, results = read_instance()
result_code, results = run_instance()

Putting it all together

You can do interesting stuff when combining all the above features into your python script

from ttrt.common.api import API
from ttrt.common.util import Logger
from ttrt.common.util import Artifacts

API.initialize_apis()

custom_args = {}
custom_args["--clean-artifacts"] = True
custom_args["--save-artifacts"] = True
custom_args["--loops"] = 10
custom_args["--init"] = "randn"
custom_args["binary"] = "/path/to/subtract.ttnn"

log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)

artifacts_folder_path = "/opt/folder"
custom_artifacts = Artifacts(logger=custom_logger, artifacts_folder_path=artifacts_folder_path)

run_instance = API.Run(args=custom_args, logger=custom_logger, artifacts=custom_artifacts)
result_code, results = run_instance()

Runtime integration

The full set of ttrt.runtime exposed APIs and types can be found in runtime/python/runtime/runtime.cpp, however only the ones intended to be used for runtime customization through callback hooks are outlined here.

Callback hooks

MLIR Runtime exposes a feature to register python callback functions. Any two python fuctions can be provided - the first function will be executed before every op in MLIR Runtime, the second after every op. The following steps describe how to extend your application to register python functions. Callback functions are already implemented by default for pbd debugger implementation and gathering memory and golden check data as outlined in the run API section.

Pybind DebugHooks C++ class, specifically tt::runtime::debug::Hooks::get. See runtime/python/runtime/runtime.cpp for an example of how ttrt pybinds it.

tt::runtime::debug::Hooks
tt::runtime::debug::Hooks::get

Register callback functions in your python script. The following is registering the two callback functions written in runtime/tools/ttrt/ttrt/common/callback.py. The Debug Hooks get function has been pybinded to ttrt.runtime.DebugHooks.get

import ttrt.runtime

callback_env = ttrt.runtime.DebugHooks.get(pre_op_callback_runtime_config, post_op_callback_runtime_config)

The callback function has a particular function signature, which looks like the following

def pre_op_callback_runtime_config(binary, program_context, op_context):

binary: reference to the binary you are currently running, ttrt.binary Binary object program_context: reference to the program currently running, ttrt.runtime ProgramContext object op_context: reference to the op that is currently running, ttrt.runtime OpContext object

Each of these parameters has certain runtime APIs exposed which can only be called within the callback functions since they rely on the op_context variable that is only available from runtime during callbacks.

import ttrt.runtime

loc = ttrt.runtime.get_op_loc_info(op_context) : get the location of the op as a string which is used as the key when indexing the golden tensors stored in the flatbuffer
op_debug_str = ttrt.runtime.get_op_debug_str(op_context) : get the op debug str (contains op metadata inculding op type, attributes, input tensor shapes and dtypes, memref with layout and buffer type, and loc)
op_golden_tensor = ttrt.runtime.get_debug_info_golden(binary, loc) : get the golden tensor from the binary as a ttrt.binary GoldenTensor object
op_output_tensor = ttrt.runtime.get_op_output_tensor(op_context, program_context) : get the currently running output tensor from device as a ttrt.runtime Tensor object, if this is called in a preOp function or the op doesn't output a tensor, an empty tensor will be returned.

Note: ttrt is not needed to implement this callback feature. It aims to provide an example of how this callback feature can be implemented for golden application.

FAQ

Flatbuffer version does not match `ttrt` version!

ttrt and flatbuffer have strict versioning that is checked during ttrt execution. You will have to generate a flatbuffer using the same version of ttrt (or vice versa). This mean you might have to build on the same branch on which the flatbuffer was generated or regenerate the flatbuffer using your current build.

System desc does not match flatbuffer!

Flatbuffers are compiled using a specific system desc (or default values if no system desc is provided). During runtime, the flatbuffer system desc is checked against the current system to ensure the system being run on supports the flatbuffer that was compiled. If you get this error, you will have to regenerate the flatbuffer using the system you want to run on. See generate a flatbuffer file from compiler section on how to do this.

I just want to test and push my commit! What do I do!

Follow these steps (on n150, n300, and llmbox)

Build ttmlir (sample instructions - subject to change)

source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17 -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DTTMLIR_ENABLE_RUNTIME=ON -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
cmake --build build

Build ttrt (sample instructions - subject to change)

cmake --build build -- ttrt

Query system

ttrt query --save-artifacts

Export system desc file

export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys (path dumped in previous command)

Generate test cases

cmake --build build -- check-ttmlir

Run test cases

ttrt run build/test/ttmlir/Silicon

(Optional) Run perf test cases

ttrt perf build/test/ttmlir/Silicon

TTRT yields an ambiguous segmentation fault!

The ttrt toolchain has specific behaviors and requirements that can lead to build and runtime issues, particularly when dealing with version mismatches or out-of-sync dependencies.

Version Mismatch Due to Local Commits

The ttrt toolchain verifies whether the current system configuration matches the model’s compilation environment. This verification involves tracking the number of commits since the last synchronization. When local commits are made in your branch, it may trigger a version mismatch between the compiled model and the current environment. This mismatch may not be handled properly by the runtime (rt), leading to potential issues.

To resolve issues stemming from these synchronization problems, follow this workflow:

Incremental build

# make some changes
# commit
cmake --build build
cmake --build build -- ttrt
# note you need to generate system_desc and flatbuffer again once you do this

This incremental build should be sufficient. If it does not resolve the error, please file an issue and proceed with the following steps for now.

Clear the existing build and dependencies:

rm -rf build third_party/tt-metal

This ensures that all previous build artifacts and dependencies are removed, preventing conflicts or stale files from affecting the new build.

Rebuild from scratch: After clearing the build directories, rebuild the project from the ground up. This ensures that the build process incorporates all the necessary components without any remnants of previous builds. Build Instructions
Switch build configurations: If switching from a Debug to a Release build (or vice versa), ensure that you clean the build environment before transitioning. This avoids inconsistencies between build configurations and potential issues with optimization levels or debugging symbols.
Re-acquire the IRD: By relinquishing and re-acquiring the IRD, you ensure that the correct toolchain is used for the new build. This step ensures synchronization between the model and the toolchain.
Enable Debug Logging for tt-metal: To gain more insight into potential issues, enable debugging by setting the TT_METAL_LOGGER_LEVEL to DEBUG. This will provide detailed logs, which can help in troubleshooting build or runtime issues.

export TT_METAL_LOGGER_LEVEL=DEBUG

`ttir-builder`

ttir-builder is a tool for creating TTIR operations. It provides support for MLIR modules to be generated from user-constructed ops, lowered into TTNN or TTMetal backends, and finally translated into executable flatbuffers. Or you can do all three at once!

Building

Build tt-mlir
Build ttrt
Generate ttsys file from the system you want to compile for using ttrt. This will create a ttrt-artifacts folder containing a system_desc.ttsys file.

ttrt query --save-artifacts

Export this file in your environment using export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys. builder.base.builder_utils uses the system_desc.ttsys file as it runs a pass over an MLIR module to the TTNN or TTMetal backend.

Getting started

TTIRBuilder is a builder class providing the API for creating TTIR ops. The python package builder contains everything needed to create ops through a TTIRBuilder object. builder.base.builder_utils contains the APIs for wrapping op-creating-functions into MLIR modules and flatbuffers files.

from builder.ttir.ttir_builder import TTIRBuilder
from builder.base.builder_utils import compile_ttir_to_flatbuffer

Creating a TTIR module

build_ttir_module defines an MLIR module specified as a python function. It wraps fn in a MLIR FuncOp then wraps that in an MLIR module, and finally ties arguments of that FuncOp to test function inputs. It will instantiate and pass a TTIRBuilder object as the last argument of fn. Each op returns an OpView type which is a type of Operand that can be passed into another builder op as an input.

def build_ttir_module(
    fn: Callable,
    inputs_shapes: List[Shape],
    inputs_types: Optional[List[Union[torch.dtype, TypeInfo]]] = None,
    mesh_name: str = "mesh",
    mesh_dict: OrderedDict[str, int] = OrderedDict([("x", 1), ("y", 1)]),
    module_dump: bool = False,
    base: Optional[str] = None,
    output_root: str = ".",
)

Example

from builder.base.builder import Operand
from builder.ttir.ttir_builder import TTIRBuilder
from builder.base.builder_utils import build_ttir_module

shapes = [(32, 32), (32, 32), (32, 32)]

def model(in0: Operand, in1: Operand, in2: Operand, builder: TTIRBuilder):
    add_0 = builder.add(in0, in1)
    multiply_1 = builder.multiply(in1, add_0)
    return builder.multiply(multiply_1, in2)

module, builder = build_ttir_module(model, shapes)

Returns

An MLIR module containing an MLIR op graph defined by fn and the TTIRBuilder object used to create it

module {
  func.func @model(%arg0: tensor<32x32xf32>, %arg1: tensor<32x32xf32>, %arg2: tensor<32x32xf32>) -> tensor<32x32xf32> {
    %0 = ttir.empty() : tensor<32x32xf32>
    %1 = "ttir.add"(%arg0, %arg1, %0) : (tensor<32x32xf32>, tensor<32x32xf32>, tensor<32x32xf32>) -> tensor<32x32xf32>
    %2 = ttir.empty() : tensor<32x32xf32>
    %3 = "ttir.multiply"(%arg1, %1, %2) : (tensor<32x32xf32>, tensor<32x32xf32>, tensor<32x32xf32>) -> tensor<32x32xf32>
    %4 = ttir.empty() : tensor<32x32xf32>
    %5 = "ttir.multiply"(%3, %arg2, %4) : (tensor<32x32xf32>, tensor<32x32xf32>, tensor<32x32xf32>) -> tensor<32x32xf32>
    return %5 : tensor<32x32xf32>
  }
}

Running a pipeline

run_ttir_pipeline runs a pass on the TTIR module to lower it into a backend, using pipeline_fn. You can pass pipeline_fn in as one of the following: ttir_to_ttnn_backend_pipeline, ttir_to_ttmetal_backend_pipeline (both found in ttmlir.passes), or a custom pipeline built with create_custom_pipeline_fn. The default if none is provided is the TTNN pipeline.

def run_ttir_pipeline(
    module,
    pipeline_fn: Callable = ttir_to_ttnn_backend_pipeline,
    pipeline_options: List[str] = None,
    dump_to_file: bool = True,
    output_file_name: str = "test.mlir",
    system_desc_path: Optional[str] = None,
    mesh_name: str = "mesh",
    mesh_dict: OrderedDict[str, int] = OrderedDict([("x", 1), ("y", 1)]),
    argument_types_string: Optional[str] = None,
)

TTNN example

Let's expand on our previous example

from ttmlir.passes import ttir_to_ttnn_backend_pipeline
from builder.base.builder import Operand
from builder.ttir.ttir_builder import TTIRBuilder
from builder.base.builder_utils import build_ttir_module, run_ttir_pipeline

shapes = [(32, 32), (32, 32), (32, 32)]

def model(in0: Operand, in1: Operand, in2: Operand, builder: TTIRBuilder):
    add_0 = builder.add(in0, in1)
    multiply_1 = builder.multiply(in1, add_0)
    return builder.multiply(multiply_1, in2)

module, builder = build_ttir_module(model, shapes)
ttnn_module = run_ttir_pipeline(module, ttir_to_ttnn_backend_pipeline)

Returns

An MLIR module lowered into TTNN

#dram = #ttnn.buffer_type<dram>
#system_desc = #ttcore.system_desc<[{role = host, target_triple = "x86_64-pc-linux"}], [{arch = <wormhole_b0>, grid = 8x8, coord_translation_offsets = 18x18, l1_size = 1499136, num_dram_channels = 12, dram_channel_size = 1073741824, noc_l1_address_align_bytes = 16, pcie_address_align_bytes = 32, noc_dram_address_align_bytes = 32, l1_unreserved_base = 97248, erisc_l1_unreserved_base = 69632, dram_unreserved_base = 32, dram_unreserved_end = 1073158336, physical_helper_cores = {dram = [ 0x0,  0x1,  0x2,  0x3,  0x4,  0x5,  0x6,  0x7,  0x8,  0x9,  0x10,  0x11] eth_inactive = [ 16x18,  16x19,  16x20,  16x21,  16x22,  16x23,  16x24,  16x25,  17x19,  17x20,  17x22,  17x23,  17x24]}, supported_data_types = [<f32>, <f16>, <bf16>, <bfp_f8>, <bfp_bf8>, <bfp_f4>, <bfp_bf4>, <bfp_f2>, <bfp_bf2>, <u32>, <u16>, <u8>, <si32>], supported_tile_sizes = [ 4x16,  16x16,  32x16,  4x32,  16x32,  32x32], num_cbs = 32, num_compute_threads = 1, num_datamovement_threads = 2}], [0], [3 : i32], [ 0x0x0x0]>
#ttnn_layout = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<1x1x!ttcore.tile<32x32, f32>, #dram>, <interleaved>>
module {
  ttcore.device_module {
    builtin.module attributes {ttcore.system_desc = #system_desc} {
      ttcore.device @default_device = <workerGrid = #ttcore.grid<8x8, (d0, d1) -> (0, d0, d1)>, l1Map = (d0, d1, d2)[s0] -> (0, d0, d1, d2 + s0), dramMap = (d0, d1, d2)[s0, s1, s2, s3, s4, s5] -> (0, 0, (((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) floordiv s4) mod 12, ((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) floordiv (s4 * 12) + ((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) mod s4 + s5), meshShape = , chipIds = [0]>
      func.func @model(%arg0: tensor<32x32xf32, #ttnn_layout>, %arg1: tensor<32x32xf32, #ttnn_layout>, %arg2: tensor<32x32xf32, #ttnn_layout>) -> tensor<32x32xf32, #ttnn_layout> {
        %0 = "ttnn.abs"(%arg0) : (tensor<32x32xf32, #ttnn_layout>) -> tensor<32x32xf32, #ttnn_layout>
        "ttnn.deallocate"(%arg0) <{force = false}> : (tensor<32x32xf32, #ttnn_layout>) -> ()
        %1 = "ttnn.multiply"(%arg1, %0) : (tensor<32x32xf32, #ttnn_layout>, tensor<32x32xf32, #ttnn_layout>) -> tensor<32x32xf32, #ttnn_layout>
        "ttnn.deallocate"(%0) <{force = false}> : (tensor<32x32xf32, #ttnn_layout>) -> ()
        "ttnn.deallocate"(%arg1) <{force = false}> : (tensor<32x32xf32, #ttnn_layout>) -> ()
        %2 = "ttnn.multiply"(%1, %arg2) : (tensor<32x32xf32, #ttnn_layout>, tensor<32x32xf32, #ttnn_layout>) -> tensor<32x32xf32, #ttnn_layout>
        "ttnn.deallocate"(%1) <{force = false}> : (tensor<32x32xf32, #ttnn_layout>) -> ()
        "ttnn.deallocate"(%arg2) <{force = false}> : (tensor<32x32xf32, #ttnn_layout>) -> ()
        return %2 : tensor<32x32xf32, #ttnn_layout>
      }
    }
  }
}

TTMetal example

Let's use the same code for TTMetal that was used in the TTNN example but change the pipeline_fn to ttir_to_ttmetal_backend_pipeline. Only one or the other can be run on a module since run_ttir_pipeline modifies the module in place. Note that while all TTIR ops supported by builder can be lowered to TTNN, not all can be lowered to TTMetal yet. Adding documentation to specify what ops can be lowered to TTMetal is in the works.

from ttmlir.passes import ttir_to_ttmetal_backend_pipeline
ttmetal_module = run_ttir_pipeline(module, ttir_to_ttmetal_backend_pipeline)

Returns

An MLIR module lowered into TTMetal

#l1 = #ttcore.memory_space<l1>
#system_desc = #ttcore.system_desc<[{role = host, target_triple = "x86_64-pc-linux-gnu"}], [{arch = <wormhole_b0>, grid = 8x8, coord_translation_offsets = 18x18, l1_size = 1499136, num_dram_channels = 12, dram_channel_size = 1073741824, noc_l1_address_align_bytes = 16, pcie_address_align_bytes = 32, noc_dram_address_align_bytes = 32, l1_unreserved_base = 1024, erisc_l1_unreserved_base = 1024, dram_unreserved_base = 1024, dram_unreserved_end = 1073741824, physical_helper_cores = {dram = [ 8x0,  9x0,  10x0,  8x1,  9x1,  10x1,  8x2,  9x2,  10x2,  8x3,  9x3,  10x3]}, supported_data_types = [<f32>, <f16>, <bf16>, <bfp_f8>, <bfp_bf8>, <bfp_f4>, <bfp_bf4>, <bfp_f2>, <bfp_bf2>, <u32>, <u16>, <u8>, <si32>], supported_tile_sizes = [ 4x16,  16x16,  32x16,  4x32,  16x32,  32x32], num_cbs = 32, num_compute_threads = 1, num_datamovement_threads = 2}], [0], [3 : i32], [ 0x0x0x0]>
module {
  ttcore.device_module {
    builtin.module attributes {ttcore.system_desc = #system_desc} {
      ttcore.device @default_device = <workerGrid = #ttcore.grid<8x8, (d0, d1) -> (0, d0, d1)>, l1Map = (d0, d1, d2)[s0] -> (0, d0, d1, d2 + s0), dramMap = (d0, d1, d2)[s0, s1, s2, s3, s4, s5] -> (0, 0, (((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) floordiv s4) mod 12, ((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) floordiv (s4 * 12) + ((d0 * s1) * (s2 * s3) + d1 * (s2 * s3) + d2) mod s4 + s5), meshShape = , chipIds = [0]>
      func.func @model(%arg0: memref<32x32xf32>, %arg1: memref<32x32xf32>, %arg2: memref<32x32xf32>) -> memref<32x32xf32> {
        %0 = "ttmetal.create_buffer"() <{address = 9216 : i64}> : () -> memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>
        %1 = "ttmetal.create_buffer"() <{address = 1024 : i64}> : () -> memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>
        "ttmetal.enqueue_write_buffer"(%arg0, %1) : (memref<32x32xf32>, memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>) -> ()
        "ttmetal.enqueue_program"(%1, %0, %1, %0) <{cb_ports = array<i64: 0, 1>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel0, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, noc0>, #ttmetal.compute_config<@compute_kernel1, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 2, 2>}> : (memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%1) : (memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>) -> ()
        %2 = "ttmetal.create_buffer"() <{address = 1024 : i64}> : () -> memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>
        %3 = "ttmetal.create_buffer"() <{address = 5120 : i64}> : () -> memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>
        "ttmetal.enqueue_write_buffer"(%arg1, %3) : (memref<32x32xf32>, memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>) -> ()
        "ttmetal.enqueue_program"(%3, %2, %3, %2) <{cb_ports = array<i64: 0, 1>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel2, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, noc0>, #ttmetal.compute_config<@compute_kernel3, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 2, 2>}> : (memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%3) : (memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>) -> ()
        %4 = "ttmetal.create_buffer"() <{address = 13312 : i64}> : () -> memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>
        "ttmetal.enqueue_program"(%0, %2, %4, %0, %2, %4) <{cb_ports = array<i64: 0, 1, 2>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel4, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc0>, #ttmetal.noc_config<@datamovement_kernel5, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc1>, #ttmetal.compute_config<@compute_kernel6, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 3, 3>}> : (memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%0) : (memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%2) : (memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        %5 = "ttmetal.create_buffer"() <{address = 1024 : i64}> : () -> memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>
        %6 = "ttmetal.create_buffer"() <{address = 5120 : i64}> : () -> memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>
        "ttmetal.enqueue_write_buffer"(%arg1, %6) : (memref<32x32xf32>, memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>) -> ()
        "ttmetal.enqueue_program"(%6, %5, %6, %5) <{cb_ports = array<i64: 0, 1>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel7, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, noc0>, #ttmetal.compute_config<@compute_kernel8, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 2, 2>}> : (memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%6) : (memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>) -> ()
        %7 = "ttmetal.create_buffer"() <{address = 17408 : i64}> : () -> memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>
        "ttmetal.enqueue_program"(%5, %4, %7, %5, %4, %7) <{cb_ports = array<i64: 0, 1, 2>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel9, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc0>, #ttmetal.noc_config<@datamovement_kernel10, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc1>, #ttmetal.compute_config<@compute_kernel11, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 3, 3>}> : (memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%5) : (memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%4) : (memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        %8 = "ttmetal.create_buffer"() <{address = 9216 : i64}> : () -> memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>
        %9 = "ttmetal.create_buffer"() <{address = 1024 : i64}> : () -> memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>
        "ttmetal.enqueue_write_buffer"(%arg2, %9) : (memref<32x32xf32>, memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>) -> ()
        "ttmetal.enqueue_program"(%9, %8, %9, %8) <{cb_ports = array<i64: 0, 1>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel12, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, noc0>, #ttmetal.compute_config<@compute_kernel13, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 2, 2>}> : (memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%9) : (memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>) -> ()
        %10 = "ttmetal.create_buffer"() <{address = 5120 : i64}> : () -> memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>
        "ttmetal.enqueue_program"(%7, %8, %10, %7, %8, %10) <{cb_ports = array<i64: 0, 1, 2>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel14, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc0>, #ttmetal.noc_config<@datamovement_kernel15, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, noc1>, #ttmetal.compute_config<@compute_kernel16, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>, <cb_port[2]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 3, 3>}> : (memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%8) : (memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%7) : (memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        %alloc = memref.alloc() : memref<32x32xf32>
        %11 = "ttmetal.create_buffer"() <{address = 1024 : i64}> : () -> memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>
        "ttmetal.enqueue_program"(%10, %11, %10, %11) <{cb_ports = array<i64: 0, 1>, kernelConfigs = [#ttmetal.noc_config<@datamovement_kernel17, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, noc0>, #ttmetal.compute_config<@compute_kernel18, #ttmetal.core_range<0x0, 1x1>, #ttmetal.kernel_args< ct_args = [<cb_port[0]>, <cb_port[1]>]>, hifi4, false, false, [default]>], operandSegmentSizes = array<i32: 2, 2>}> : (memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>, memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>, memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>) -> ()
        "ttmetal.deallocate_buffer"(%10) : (memref<1x1x1x1x!ttcore.tile<32x32, f32>, #ttcore.shard<4096x4096>, #l1>) -> ()
        "ttmetal.enqueue_read_buffer"(%11, %alloc) : (memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>, memref<32x32xf32>) -> ()
        "ttmetal.finish"() : () -> ()
        "ttmetal.deallocate_buffer"(%11) : (memref<1x1x32x32xf32, #ttcore.shard<128x4>, #l1>) -> ()
        return %alloc : memref<32x32xf32>
      }
      func.func private @datamovement_kernel0() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel1() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %2 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "tilize_init"(%1, %0, %2) : (!emitc.opaque<"::tt::CB">, i32, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "experimental::tilize_block"(%1, %2, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, i32, i32) -> ()
        emitc.call_opaque "cb_push_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel2() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel3() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %2 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "tilize_init"(%1, %0, %2) : (!emitc.opaque<"::tt::CB">, i32, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "experimental::tilize_block"(%1, %2, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, i32, i32) -> ()
        emitc.call_opaque "cb_push_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel4() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel5() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel6() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 0 : index}> : () -> !emitc.size_t
        %1 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        emitc.call_opaque "tile_regs_acquire"() : () -> ()
        %2 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %3 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        %4 = emitc.literal "get_compile_time_arg_val(2)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "binary_op_init_common"(%2, %3, %4) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "add_tiles_init"(%2, %3) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "add_tiles"(%2, %3, %0, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.size_t, !emitc.size_t, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_commit"() : () -> ()
        emitc.call_opaque "tile_regs_wait"() : () -> ()
        emitc.call_opaque "pack_tile"(%0, %4, %0) {template_args = [true]} : (!emitc.size_t, !emitc.opaque<"::tt::CB">, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_release"() : () -> ()
        emitc.call_opaque "cb_push_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel7() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel8() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %2 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "tilize_init"(%1, %0, %2) : (!emitc.opaque<"::tt::CB">, i32, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "experimental::tilize_block"(%1, %2, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, i32, i32) -> ()
        emitc.call_opaque "cb_push_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel9() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel10() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel11() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 0 : index}> : () -> !emitc.size_t
        %1 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        emitc.call_opaque "tile_regs_acquire"() : () -> ()
        %2 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %3 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        %4 = emitc.literal "get_compile_time_arg_val(2)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "binary_op_init_common"(%2, %3, %4) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "mul_tiles_init"(%2, %3) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "mul_tiles"(%2, %3, %0, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.size_t, !emitc.size_t, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_commit"() : () -> ()
        emitc.call_opaque "tile_regs_wait"() : () -> ()
        emitc.call_opaque "pack_tile"(%0, %4, %0) {template_args = [true]} : (!emitc.size_t, !emitc.opaque<"::tt::CB">, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_release"() : () -> ()
        emitc.call_opaque "cb_push_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel12() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel13() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %2 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "tilize_init"(%1, %0, %2) : (!emitc.opaque<"::tt::CB">, i32, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "experimental::tilize_block"(%1, %2, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, i32, i32) -> ()
        emitc.call_opaque "cb_push_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel14() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel15() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel16() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>, <arg_type = cb_port, operand_index = 2>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 0 : index}> : () -> !emitc.size_t
        %1 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        emitc.call_opaque "tile_regs_acquire"() : () -> ()
        %2 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %3 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        %4 = emitc.literal "get_compile_time_arg_val(2)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "binary_op_init_common"(%2, %3, %4) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "mul_tiles_init"(%2, %3) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "mul_tiles"(%2, %3, %0, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, !emitc.size_t, !emitc.size_t, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_commit"() : () -> ()
        emitc.call_opaque "tile_regs_wait"() : () -> ()
        emitc.call_opaque "pack_tile"(%0, %4, %0) {template_args = [true]} : (!emitc.size_t, !emitc.opaque<"::tt::CB">, !emitc.size_t) -> ()
        emitc.call_opaque "tile_regs_release"() : () -> ()
        emitc.call_opaque "cb_push_back"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%3, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%4, %1) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @datamovement_kernel17() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<noc>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_push_back"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
      func.func private @compute_kernel18() attributes {ttkernel.arg_spec = #ttkernel.arg_spec< ct_args = [<arg_type = cb_port, operand_index = 0>, <arg_type = cb_port, operand_index = 1>]>, ttkernel.thread = #ttkernel.thread<compute>} {
        %0 = "emitc.constant"() <{value = 1 : i32}> : () -> i32
        %1 = emitc.literal "get_compile_time_arg_val(0)" : !emitc.opaque<"::tt::CB">
        %2 = emitc.literal "get_compile_time_arg_val(1)" : !emitc.opaque<"::tt::CB">
        emitc.call_opaque "cb_reserve_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "untilize_init"(%1) : (!emitc.opaque<"::tt::CB">) -> ()
        emitc.call_opaque "experimental::untilize_block"(%1, %2, %0, %0) : (!emitc.opaque<"::tt::CB">, !emitc.opaque<"::tt::CB">, i32, i32) -> ()
        emitc.call_opaque "cb_push_back"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_wait_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%1, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        emitc.call_opaque "cb_pop_front"(%2, %0) : (!emitc.opaque<"::tt::CB">, i32) -> ()
        return
      }
    }
  }
}

Compiling into flatbuffer

compile_ttir_to_flatbuffer compiles a TTIRBuilder function fn straight to flatbuffer. This decorator is mainly a wrapper around the following functions, with each next function called on the output of the last: build_ttir_module, run_ttir_pipeline, and ttnn_to_flatbuffer_file or ttmetal_to_flatbuffer_file as dictated by the target parameter.

def compile_ttir_to_flatbuffer(
    fn: Callable,
    inputs_shapes: List[Shape],
    inputs_types: Optional[List[Union[torch.dtype, TypeInfo]]] = None,
    system_desc_path: str = "ttrt-artifacts/system_desc.ttsys",
    test_base: str = "test",
    output_root: str = ".",
    target: Literal["ttnn", "ttmetal"] = "ttnn",
    mesh_name: str = "mesh",
    mesh_dict: OrderedDict[str, int] = OrderedDict([("x", 1), ("y", 1)]),
    module_dump: bool = True,
    argument_types_string: Optional[str] = None,
    custom_pipeline: Union[Callable, str] = None,
    pipeline_options: List[str] = None,
)

No flatbuffer is printed or returned. It's only written to a file because it is created as an unsupported text encoding.

TTNN example

Let's use our previous model function.

from builder.base.builder import Operand
from builder.ttir.ttir_builder import TTIRBuilder
from builder.base.builder_utils import compile_ttir_to_flatbuffer

shapes = [(32, 32), (32, 32), (32, 32)]

def model(in0: Operand, in1: Operand, in2: Operand, builder: TTIRBuilder):
    add_0 = builder.add(in0, in1)
    multiply_1 = builder.multiply(in1, add_0)
    return builder.multiply(multiply_1, in2)

compile_ttir_to_flatbuffer(
    model,
    shapes,
    target="ttnn",
)

TTMetal example

Let's once again use the same code for TTMetal that was used in the TTNN example but change the target to "ttmetal". Just as with run_ttir_pipeline, only one or the other can be run on a module since compile_ttir_to_flatbuffer modifies the module in place.

compile_ttir_to_flatbuffer(
    model,
    shapes,
    target="ttmetal",
)

Integrating with other tt-mlir tools

Alternatives for file creation

The ttmlir-opt tool runs a compiler pass on an .mlir file.
The ttmlir-translate can generate a flatbuffer from an .mlir file.
llvm-lit can also be used to generate a flatbuffer from an existing .mlir file.

Running models

ttrt

ttrt is intended to be a swiss army knife for working with flatbuffers.

tt-explorer

tt-explorer is a visualizer tool for ttmlir-powered compiler results.

ttnn-standalone

ttnn-standalone is a post-compile tuning/debugging tool.

llvm-lit

llvm-lit can also be used for MLIR testing.

Golden mode

Golden dataclass

TTIRBuilder provides support to code golden tensors into flatbuffers which will be used for comparison with TT device output in ttrt runtime. Golden is the dataclass used to store information about a golden tensor. Each TTIR op should have a matching PyTorch op (or golden function built from PyTorch ops) which should perform exactly the same operation, generating the same outputs given the same inputs. You can use TTIRBuilder helper functions to store input, intermediate, and output tensors within the flatbuffer. Input and output goldens are mapped with keys "input_" and "output_" followed by a tensor index: input_0. Intermediate output tensors are mapped to the location of the respective op creation.

GoldenCheckLevel Enum

TTIRBuilder stores an instance of the class GoldenCheckLevel(Enum) that dictates golden handling. It defaults to GoldenCheckLevel.OP_LEVEL. The exception is that TTIRBuilder CCL ops force the golden level to be set to GRAPH_LEVEL.

DISABLED : do not store goldens
OP_LEVEL : check every single op level goldens
GRAPH_LEVEL : check graph level goldens only

Check and set GoldenCheckLevel with TTIRBuilder APIs.

from builder.base.builder import Operand, GoldenCheckLevel
from builder.ttir.ttir_builder import TTIRBuilder

def model(in0: Operand, in1: Operand, in2: Operand, builder: TTIRBuilder):
    builder.golden_check_level = GoldenCheckLevel.GRAPH_LEVEL
    add_0 = builder.add(in0, in1)
    multiply_1 = builder.multiply(in1, add_0)
    return builder.multiply(multiply_1, in2)

Getting golden data

Unless otherwise specified in the GoldenCheckLevel, all input and output tensors will generate and store a golden in TTIRBuilder as a Golden type.

The TTIRBuilder API get_golden_map(self) is used to export golden data for flatbuffer construction. It returns a dictionary of golden tensor names and GoldenTensor objects.

To get info from a GoldenTensor object, use the attributes supported by ttmlir.passes: name, shape, strides, dtype, data.

from ttmlir.passes import GoldenTensor
from builder.ttir.ttir_builder import TTIRBuilder

shapes = [(32, 32), (32, 32), (32, 32)]

def model(in0: Operand, in1: Operand, in2: Operand, builder: TTIRBuilder):
    add_0 = builder.add(in0, in1)
    builder.print_goldens()
    print(builder.get_golden_map())
    return add0

Golden tensor:
tensor([[ 4.0450e+00,  1.4274e+00,  5.9156e-01,  ..., -5.9834e-01,
         -1.1830e-01,  1.2837e-01],
        [ 2.3788e+00,  2.9242e-03, -5.2838e-02,  ...,  1.8294e+00,
          5.0348e+00,  9.7179e-01],
        [ 1.5168e-02,  1.0577e-01, -3.0682e-01,  ...,  6.7212e-01,
          9.4523e-02,  5.3765e+00],
        ...,
        [ 1.4241e-01,  1.1838e+00, -1.0601e+00,  ...,  4.9099e-01,
          4.2267e+00,  4.0610e-01],
        [ 5.6630e-01, -1.3068e-01, -1.7771e-01,  ...,  2.3862e+00,
          3.9376e-01,  7.3140e-01],
        [ 4.2420e+00,  1.7006e-01, -3.4861e-01,  ...,  1.1471e-01,
          1.6189e+00, -6.9106e-01]])
{'input_0': <ttmlir._mlir_libs._ttmlir.passes.GoldenTensor object at 0x7f77c70fa0d0>, 'output_0': <ttmlir._mlir_libs._ttmlir.passes.GoldenTensor object at 0x7f77c6fc9590>}

Setting golden data

Use TTIRBuilder API set_graph_input_output to set your own input and output golden tensors using PyTorch tensors. Keep in mind that this also sets graph inputs and outputs. There are some functions for which setting custom input tensors is required to pass PCC accuracy checks: ttir.tan, ttir.log, ttir.log1p. See example implementation and explanation in test/python/golden/test_ttir_ops.py.

set_graph_input_output(
        self,
        inputs: List[torch.Tensor],
        outputs: Optional[List[torch.Tensor]] = None,
        override: bool = False,
    )

import torch

input_0 = torch.ones((32, 32))
output_0 = torch.zeros((32, 32))
builder.set_graph_input_output([input_0], [output_0], override=True)

Running flatbuffer with golden data in ttrt

Running flatbuffers in ttrt requires additional building and setting up the environment. Run these commands before creating MLIR modules or flatbuffers so the system description in the flatbuffers match your device.

cmake --build build -- ttrt
ttrt query --save-artifacts
export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys

Set environment variable TTRT_LOGGER_LEVEL to DEBUG so ttrt logs golden comparison results and prints graph level golden tensors.

export TTRT_LOGGER_LEVEL=DEBUG

Finally run ttrt. Our example flatbuffer file (since we didn't specify otherwise) defaulted to file path ./ttnn/test_ttnn.mlir.ttnn. --log-file ttrt.log and --save-golden-tensors are both optional flags. They ensure that all golden data produced by the ttrt run gets written to files.

ttrt run ttnn/test_ttnn.mlir.ttnn --log-file ttrt.log --save-golden-tensors

Golden callbacks

The ttrt documentation contains a section on the callback function feature. Callback functions run between each op execution during runtime and contain op level golden analysis. They are also customizable and provide the flexibility for you to get creative with your golden usage.

builder.apis

class builder.base.builder.TypeInfo(dtype: 'torch.dtype', scale: 'Optional[float]' = None, zero_point: 'Optional[int]' = None)

Bases: object

Parameters:
- dtype (dtype)
- scale (float | None)
- zero_point (int | None)

dtype : dtype

scale : float | None = None

zero_point : int | None = None

class builder.base.builder.Golden(tensor: 'Union[torch.Tensor, ShardedTensor]', seed: 'Optional[int]' = None)

Bases: object

Parameters:
- tensor (Tensor | ShardedTensor)
- seed (int | None)

contiguous()

Return type: Golden

seed : int | None = None

tensor : Tensor | ShardedTensor

class builder.base.builder.GoldenCheckLevel(value)

Bases: Enum

An enumeration.

DISABLED = 1

GRAPH_LEVEL = 3

OP_LEVEL = 2

builder.base.builder.Operand

alias of Value | OpView | Operation

builder.base.builder.Shape

alias of List[int] | Tuple[int, …]

Adding a new op to `ttir-builder`

ttir-builder is designed to only create ops supported in TTIR. At the moment, most but not all ops are supported, and new ops are still occasionally added to TTIR. Creating ttir-builder support for an op entails writing a function in tools/builder/ttir/ttir_builder.py that will create the op and its golden counterpart.

TTIR op factories

All ops are created when their relevant information is run through the _op_proxy function which provides a general interface for proxy-ing and creating ops.

def _op_proxy(
    self,
    op_ttir_function: Callable,
    inputs: List[Operand],
    unit_attrs: List[str] = None,
    organize_ttir_args: Optional[Callable] = None,
    organize_golden_args: Optional[Callable] = None,
    output_shape: Optional[Shape] = None,
    output_type: Optional[Type] = None,
    output_create_fn: Optional[Callable] = None,
    golden_kwargs: dict = {},
    ttir_kwargs: dict = {},
)

Start by finding the TTIR op you wish to replicate in include/ttmlir/Dialect/TTIR/IR/TTIROps.td or the TTIR dialect documentation.

All op attributes should be included as arguments in your function and passed into a proxy function as keyword arguments using ttir_kwargs.

All input operands should be passed into a proxy function using the argument inputs. Output operands are considered inputs and can optionally be passed into inputs if their shape or datatype is relevant to the op's result operand. organize_ttir_args dictates what information gets passed into autogenerated file build/python_packages/ttmlir/dialects/_ttir_ops_gen.py and can be used if operand arguments require special handling.

Golden functions

Golden functions provide the reference implementation for TTIR operations using PyTorch. They are centralized in tools/builder/base/builder_golden.py and must be mapped to their corresponding TTIR operations. The _op_proxy function automatically retrieves the appropriate golden function based on the TTIR operation class.

Writing a golden function

Before writing a golden function, you need to know exactly what the TTIR op does to its input data because you will have to replicate that exactly using PyTorch operations. This information is usually covered in TTIR documentation, but if not, you may have to do some detective work and trial and error. Get creative with keyword argument handling, using similar Pytorch operations, and maybe multiple operations. Google is your friend. If you have to figure out how to do something Pytorch doesn't, odds are someone online has encountered the same situation.

Golden functions should be implemented in builder_golden.py and follow this pattern:

Simple operations: If PyTorch has an identical function, you can directly use it in the mappings
Complex operations: Define a custom golden function that implements the behavior using PyTorch operations

Adding golden function mappings

All golden functions must be registered in the GOLDEN_MAPPINGS dictionary in builder_golden.py:

# In builder_golden.py
def cbrt_golden(input: torch.Tensor) -> torch.Tensor:
    """Golden function for cube root operation."""
    golden_sign = torch.sign(input)
    golden_cbrt = torch.pow(torch.abs(input), 1 / 3)
    return golden_sign * golden_cbrt

# Add to GOLDEN_MAPPINGS dictionary
GOLDEN_MAPPINGS: Dict[type, Callable] = {
    # ... other mappings ...
    ttir.CbrtOp: cbrt_golden,
    # ... more mappings ...
}

Using golden functions in ops.py

In your operation implementation in ops.py, simply pass the TTIR operation class to _op_proxy. The golden function is automatically retrieved internally:

# In ops.py
def cbrt(self, in0: Operand, unit_attrs: Optional[List[str]] = None) -> OpView:
    return self._op_proxy(
        ttir.CbrtOp,  # Golden function automatically retrieved from GOLDEN_MAPPINGS
        [in0],
        unit_attrs=unit_attrs,
    )

Adding Silicon tests

Silicon tests are created in the test/python/golden directory.

pytest test/python/golden/test_ttir_ops.py

Be sure to file an issue for failing tests and add a pytest mark for any failing or unsupported tests. The pytest marks instruct CI to ignore tests.

pytest.mark.skip("Issue number") : skip flatbuffer creation for this test
pytest.mark.fails_golden : expect this test to fail the ttrt golden check
pytest.mark.skip_config(config, ... reason=None): skip test if all of the specified targets/backends per config are present

The skip_config mark here is a little nuanced. By passing in a list of strings representing targets and/or systems (e.g. ["ttmetal", "p150"]) this mark will intelligently skip tests with that configuration. The example given will skip tests lowered to ttmetal iff we are runing on a p150 (i.e. blackhole). This functionality will be expanded to include other axes of test configuration, but target and system are sufficient for our needs at the moment.

For tests exclusive to n300 or llmbox, use the following pytest marks or add them to their respective test files.

pytestmark = pytest.mark.n300
pytestmark = pytest.mark.llmbox

Running Silicon tests

Follow these steps. The directory test/python/golden contains tests for modules, individual ops, and various machines.

1. Build ttmlir
source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17 -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DTTMLIR_ENABLE_RUNTIME=ON -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
cmake --build build

2. Build ttrt (sample instructions - subject to change)
cmake --build build -- ttrt

3. Query system
ttrt query --save-artifacts

4. Export system desc file
export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys (path dumped in previous command)

5. Generate test cases
pytest test/python/golden/test_ttir_ops.py

6. Run test cases
ttrt run ttnn
ttrt run ttmetal

Sphinx documentation

Docstrings

Sphinx generates documentation for builder ops from the docstrings in TTIRBuilder functions. This is the structure to follow when writing your docstring

"""
Creates ``ttir.add``.

*Elementwise addition operation.*

Performs elementwise addition between two tensors.
For each pair of corresponding elements, adds the element in the second
tensor to the element in the first tensor.

Mathematical definition: add(x, y) = x + y

.. code-block:: mlir

    // Add corresponding elements
    %result = ttir.add(%lhs, %rhs, %output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
    // Input tensors:
    // lhs: [3.5, 0.0, -1.2]
    // rhs: [1.5, 2.0, -3.2]
    // Output tensor:
    // [5.0, 2.0, -4.4]

Parameters
----------
in0 : Operand
    First input tensor
in1 : Operand
    Second input tensor
unit_attrs : Optional[List[str]], optional
    Optional list of unit attributes

Returns
-------
*OpView*
    A tensor containing the elementwise sum of the inputs
"""

Autogen skip

All functions in TTIRBuilder are included in documentation by default. If your op is failing any of the tests, it can't yet be added to the documentation. Custom golden functions also must be excluded. Tag those functions with autodoc_skip.

@autodoc_skip
def bitwise_not(
    self, in0: Operand, unit_attrs: Optional[List[str]] = None
) -> OpView:

`stablehlo-builder`

stablehlo-builder is a tool for creating stableHLO operations. It provides support for MLIR modules to be generated from user-constructed ops.

Getting started

StableHLOBuilder is a builder class providing the API for creating stableHLO ops. The python package builder contains everything needed to create ops through a StableHLOBuilder object. builder.stablehlo.stablehlo_utils contains the APIs for wrapping op-creating-functions into MLIR modules and flatbuffers files.

from builder.stablehlo.stablehlo_builder import StableHLOBuilder
from builder.stablehlo.stablehlo_utils import build_stablehlo_module

Creating a StableHLO module

build_stablehlo_module defines an MLIR module specified as a python function. It wraps fn in a MLIR FuncOp then wraps that in an MLIR module, and finally ties arguments of that FuncOp to test function inputs. It will instantiate and pass a StableHLOBuilder object as the last argument of fn. Each op returns an OpView type which is a type of Operand that can be passed into another builder op as an input.

def build_stablehlo_module(
    fn: Callable,
    inputs_shapes: List[Shape],
    inputs_types: Optional[List[Union[torch.dtype, TypeInfo]]] = None,
    mesh_shape: Optional[Tuple[int, int]] = None,
    module_dump: bool = False,
    base: Optional[str] = None,
    output_root: str = ".",
)

Example

from builder.base.builder import Operand
from builder.stablehlo.stablehlo_builder import StableHLOBuilder
from builder.stablehlo.stablehlo_utils import build_stablehlo_module

shapes = [(32, 32), (32, 32), (32, 32)]

def model(in0: Operand, in1: Operand, in2: Operand, builder: StableHLOBuilder):
    return builder.add(in0, in1)

module, builder = build_stablehlo_module(model, shapes)

Returns

An MLIR module containing an MLIR op graph defined by fn and the TTIRBuilder object used to create it

module {
  func.func @model(%arg0: tensor<32x32xf32>, %arg1: tensor<32x32xf32>, %arg2: tensor<32x32xf32>) -> tensor<32x32xf32> {
    %0 = stablehlo.add %arg0, %arg1 : tensor<32x32xf32>
    return %0 : tensor<32x32xf32>
  }
}

Creating a StableHLO module with Shardy annotations

StableHLOBuilder allows you to attach shardy annotations to the generated mlir graph.

Example

from builder.base.builder import Operand
from builder.stablehlo.stablehlo_builder import StableHLOBuilder
from builder.stablehlo.stablehlo_utils import build_stablehlo_module

shapes = [(32, 32), (32, 32)]

def model(in0: Operand, in1: Operand, shlo_builder: StableHLOBuilder):
    tensor_sharding_attr = shlo_builder.tensor_sharding_attr(
        mesh_name="mesh",
        dimension_shardings=[
            shlo_builder.dimension_sharding_attr(
                axes=[shlo_builder.axis_ref_attr(name="x")],
                is_closed=True,
            ),
            shlo_builder.dimension_sharding_attr(
                axes=[shlo_builder.axis_ref_attr(name="y")],
                is_closed=False,
            )
        ]
    )

    shlo_builder.sharding_constraint(in0, tensor_sharding_attr=tensor_sharding_attr)
    return shlo_builder.add(in0, in1)

module, shlo_builder = build_stablehlo_module(model, shapes)

Returns

An MLIR module containing shardy annotations.

module {
  sdy.mesh @mesh = <["x"=1, "y"=8]>
  func.func @model(%arg0: tensor<32x32xf32>, %arg1: tensor<32x32xf32>) -> tensor<32x32xf32> {
    %0 = sdy.sharding_constraint %arg0 <@mesh, [{"x"}, {"y", ?}]> : tensor<32x32xf32>
    %1 = stablehlo.add %arg0, %arg1 : tensor<32x32xf32>
    return %1 : tensor<32x32xf32>
  }
}

`tt-explorer`

Welcome to the tt-explorer wiki! The Wiki will serve as a source for documentation, examples, and general knowledge related to the TT-MLIR visualization project. The sidebar will provide navigation to relevant pages. If this is your first time hearing about the project, take a look at Project Architecture for an in-depth introduction to the tool and motivations behind it. 🙂

Overview

Visualizer tool for ttmlir-powered compiler results. Visualizes from emitted .mlir files to display compiled model, attributes, performance results, and provides a platform for human-driven overrides to gameify model tuning.

Quick Start

tt-explorer comes packaged as a tool in the tt-mlir repo. If you haven't done so yet, please refer to "Setting up the environment manually" section from the Getting Started Guide to build the environment manually.

Here is a summary of the steps needed:

Clone tt-mlir and build the environment
Run source env/activate to be in tt-mlir virtualenv for the following steps
Ensure tt-mlir is built with atleast these flags:
- -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
- -DTTMLIR_ENABLE_RUNTIME=ON
- -DTT_RUNTIME_DEBUG=ON
Build explorer target in tt-mlir using cmake --build build -- explorer
Run tt-explorer in terminal to start tt-explorer instance. (Refer to CLI section in API for specifics)
- Note: tt-explorer requires Pandas in addition to the tt-mlir System Dependencies.
Ensure server has started in tt-explorer shell instance (check for message below)
```
Starting Model Explorer server at:
http://localhost:8080
```

Building `tt-explorer`

To build tt-explorer you need first to clone and configure the environment for tt-mlir. Please refer to the Getting Started Guide.

After building and activating the virtualenv, build tt-mlir and ensure the following flags are present, as they are needed for executing models in tt-explorer and without them it won't build.

Flags required:

-DTT_RUNTIME_ENABLE_PERF_TRACE=ON
-DTTMLIR_ENABLE_RUNTIME=ON
-DTT_RUNTIME_DEBUG=ON

Then build the explorer target by running the following command:

cmake --build build -- explorer

After it finishes building, start the explorer server by running the following command:

tt-explorer

The server should then start and show a message similar to this:

Starting Model Explorer server at:
http://localhost:8080

Running `tt-explorer` CI Tests Locally

Note: CI tests are ran like described below. Here we provide the steps needed to reproduce it and debug failing CI tests locally.

tt-explorer relies on tests that are present in the tests/ directory as well as tests dynamically created through llvm-lit. Below are the steps to replicate the testing procedure seen in CI:

Make sure you're in the tt-mlir directory
You need to build the explorer target with cmake --build build -- explorer
Run and save the system descriptor ttrt query --save-artifacts
Save the system variable export SYSTEM_DESC_PATH=$(pwd)/ttrt-artifacts/system_desc.ttsys
Run and generate ttnn + MLIR tests: cmake --build build -- check-ttmlir
Save the relevant test directories:
- export TT_EXPLORER_GENERATED_MLIR_TEST_DIRS=$(pwd)/build/test/python/golden/ttnn,$(pwd)/build/test/ttmlir/Silicon/TTNN/n150/perf
- export TT_EXPLORER_GENERATED_TTNN_TEST_DIRS=$(pwd)/build/test/python/golden/ttnn
Run the pytest for tt-explorer with pytest tools/explorer/test/run_tests.py

or in a concise shell script:

# Ensure you are present in the tt-mlir directory
source env/activate

# Build Tests
cmake --build build -- explorer
ttrt query --save-artifacts
export SYSTEM_DESC_PATH=$(pwd)/ttrt-artifacts/system_desc.ttsys
cmake --build build -- check-ttmlir

# Load Tests
export TT_EXPLORER_GENERATED_MLIR_TEST_DIRS=$(pwd)/build/test/python/golden/ttnn,$(pwd)/build/test/ttmlir/Silicon/TTNN/n150/perf
export TT_EXPLORER_GENERATED_TTNN_TEST_DIRS=$(pwd)/build/test/python/golden/ttnn

# Run Tests
pytest tools/explorer/test/run_tests.py

`tt-explorer` UI

For general reference of the UI, refer to the model-explorer wiki. This section will highlight specific UI elements added to the Tenstorrent fork of model-explorer.

Model Execution

Toolbar added by tt-explorer fork

In the top right of the screen an additional button has been added to the top bar, it sends the model to the server for execution and updates the visualization once it has been executed. Once the model has executed, overlays are also created. These overlays provide information on how the execution went.

Performance Overlay

Example of performance overlays for a graph

The performance overlay is generated on every execution, it highlights the time it took to execute each node on the graph. This is visualized with a gradient from Yellow -> Red, with Yellow being the lowest time amongst all nodes on the graph, and Red being highest.

Accuracy Overlay

The accuracy overlay is only generated when executing from a compatible flatbuffer (.ttnn file extension with Debug Info). The overlay consists of either Green or Red node overlays. Green if the node passed a "golden" test, Red if not.

The value for the overlay is the actual Pearson Correlation Coefficient (PCC) value with the "golden" tensor subtracted by the expected PCC value. If the number is < 0 we know it doesn't match the expected PCC, otherwise it is an accurate comparison.

Advanced Settings

Toolbar highlighting the "configuration" button

This menu will open a window with some advanced settings for Model execution.

Opt. Policy

This dropdown provides a list of Optimization Policies which will be used when the model is executed. These policies are applied when lowering from a ttir module to an executable ttnn module.

Generate C++ Code

This toggle will run the EmitC pass in the tt-mlir compiler to generate TTNN C++ Code and make it available to you after running a model. Default value for this toggle is Off.

"Play" Button

Toolbar highlighting the "execute" button

This button invokes the execute function which will compile and execute the model. The button will then be "loading" until execution is finished. Once execution is finished a performance trace should be overlayed on the graph and it should reload.

"Code" Button

Toolbar highlighting the "execute" button

If the Generate C++ Code option is enabled, this button will become available to view and download the C++ code in a window within explorer.

"Logs" Button

Toolbar highlighting the "logs" button

This button will open a window to view the shell logs while execution is running. If any errors occur they will be displayed here.

Overridden Fields

Example of fields with overrides enabled

Certain Nodes on the graph will have attributes that are presented as editable fields. These are fields which have overrides available. This value can be changed and then sent to be recompiled, invalid configurations will result in errors.

`tt-explorer` CLI

This section provides a details about the usage of tt-explorer.

Input Models

Currently tt-explorer supports 3 types of models that can be executed/visualized.

Input Type	Execution Support	Visualization Support
`.ttnn` Flatbuffers with Debug Info	✔️	✔️
`.ttnn` Flatbuffers without Debug Info	❌	❌
`.mlir` TTIR Modules	✔️	✔️
`.mlir` TTNN Modules	❌	✔️

CLI

The CLI for tt-explorer provides a simple suite of options to start the UI:

tt-explorer -p <port> -u <url> -q

Options

-p, --port
: Port that model-explorer server will be exposed to. Default is 8080.
-u, --url: Host URL Address for server. Default is "localhost".
-q, --no-browser: Create server without opening a browser tab.

Example usage:

tt-explorer -p 8000 -u 0.0.0.0 -q

This command will start the tt-explorer server on port 8000, accessible at the address 0.0.0.0, and without opening a browser tab.

`tt-explorer` - API

TT-Adapter

The following is a reference for the REST API provided by TT-Adapter.

First, a short info-dump on how an extensible API can be built on top of Model Explorer.

Building an API using Model Explorer

The /apipost/v1/send_command endpoint provides an extensible platform with which commands are sent to be executed directly by the adapter specified. This becomes the main endpoint through which communication is facilitated between the server and client, the commands respond with an "adapter response".

Sending Commands

The body of the command must be JSON, and conform to the following interface (described below as a Typescript interface). Specific commands may narrow the field types or extend this interface providing extra information. But all interfaces should be based on this.

interface ExtensionCommand {
	cmdId: string;
	extensionId: string;
	modelPath: string;
	settings: Record<string, any>;
	deleteAfterConversion: boolean;
}

More often than not, functions do not need all of these fields, but they must all be present to properly process the command sent into the handling function on the server.

On the server side, the signature that all function that handle commands have to follow is:

class TTAdapter(Adapter):
  # ...
  def my_adapter_fn(self, model_path: str, settings: dict):
    # Parse model_path and settings objects as they are fed from send_command endpoint.
    pass

This function is invoked and called from a new instance every time. This is important to understand for the idea of persisting information on the server. As all requests to the server are stateless, the onus is often on the end-user to keep track of important information such as the path of a model they've uploaded, or the paths of important artifacts that the server has produced. tt-explorer aims to make this as easy as possible, but this may not always be possible due to the very nature of how the server works.

Information can be processed in this function as defined by the user, and often settings becomes a versatile endpoint to provide more information and context for the execution of some function. As an example, refer to ModelRunner:initialize, this function doesn't use any of the parameter, as such they are not processed at all, and the function only executes a static initialization process regardless of the parameters passed into the command.

Example request

Below is an example of the JSON request sent from the UI to the server:

{
	// tt_adapter to invoke functions from TT-Adapter
	"extensionId": "tt_adapter",
	// Name of function to be run, "convert" is built into all adapters to convert some model to graph
	"cmdId": "convert",
	// Path to model on server to be fed into function
	"modelPath": "/tmp/tmp80eg73we/mnist_sharding.mlir",
	// Object holding custom settings to be fed into function
	"settings": {
		"const_element_count_limit": 16,
		"edge_label_font_size": 7.5,
		"artificial_layer_node_count_threshold": 1000,
		"keep_layers_with_a_single_child": false,
		"show_welcome_card": false,
		"disallow_vertical_edge_labels": false,
		"show_op_node_out_of_layer_edges_without_selecting": false,
		"highlight_layer_node_inputs_outputs": false,
		"hide_empty_node_data_entries": false
	},
	// `true` if file at `modelPath` is to be deleted after function run
	"deleteAfterConversion": true
}

Adapter Response

Model Explorer was not made to allow for such an extensible framework to be tacked onto it. As such, the adapter response is processed in a very particular way before it is sent back to the user.

In particular, refer to model_explorer.utils.convert_adapter_response which is run on the output of every function.

This means that for compatibility reasons (i.e. to not stray too much from the upstream implementation that we are based off of) responses sent from the server must be in JSON format only and wrap the data on a graph property.

Below is the base typescript interface that the UI expects for the json response. Commands can define custom data inside the graph property.

/** A response received from the extension. */
interface ExtensionResponse<
	G extends Array<unknown> = Graph[],
	E extends unknown = string
> {
	graphs: G;
	error?: E;
}

For custom adapter responses. This limits the transfer of raw bytes data through different MIME Types, and requires the tt_adapter.utils.to_adapter_format which turns any dict object into a model explorer adapter compatible response. While this framework works well for graphs, it makes an "extensible" API difficult to implement.

Current API Reference

`convert`

Standard built-in conversion function, converts TTIR Module into Model Explorer Graph. Also provides settings as a platform for overrides to be applied to the graph.

Request

// As this is the base request everything is based off,
// this interface only narrows down the command to be "convert".
interface AdapterConvertCommand extends ExtensionCommand {
	cmdId: 'convert';
}

Response

// As this is the base response everything is based off,
// it is exactly the same as `ExtensionResponse`.
type AdapterConvertResponse = ExtensionResponse;

{
	"graphs": [{
		// Model Explorer Graph JSON Object
	}]
}

`initialize`

Called from TTAdapter.__init__, used to Load SystemDesc into environment.

Request

interface InitializeCommand extends ExtensionCommand {
	cmdId: 'initialize';
}

Response

type AdapterInitializeResponse = ExtensionResponse<[{
	system_desc_path: string
}]>;

{
	"graphs": [{
		"system_desc_path": "<path to system_desc.ttsys>"
	}]
}

`execute`

Called from TTAdapter.execute, executes a model.

Request

interface AdapterExecuteCommand extends ExtensionCommand {
	cmdId: 'execute';
}

Response

// When the request is successful, we don't expect any response back.
// Thus, an empty array is returned for `graphs`.
type AdapterExecuteResponse = ExtensionResponse<[]>;

{
	"graphs": []
}

`status-check`

Called from TTExplorer.status_check, it is used for checking the execution status of a model and update the UI accordingly.

Request

interface AdapterStatusCheckCommand extends ExtensionCommand {
	cmdId: 'status_check';
}

Response

type AdapterStatusCheckResponse = ExtensionResponse<[{
	isDone: boolean,
	progress: number,
	total?: number,
	timeElapsed?: number,
	currentStatus?: string,
	error?: string,
	stdout?: string,
	log_file?: string
}]>;

{
	"graphs": [{
		"isDone": false,
		"progress": 20,
		"total": 100,
		"timeElapsed": 234,
		"stdout": "Executing model...\nPath: /path/to/model",
		"log_file": "/path/to/log/on/the/server"
	}]
}

Editable attributes

To enable an attribute to be edited, a response coming from the server should contain the editable field on the attribute.

The typescript interface is as follows:

interface Graph {
	nodes: GraphNode[];
	// ...
}

interface GraphNode {
	attrs?: Attribute[];
	// ...
}

type EditableAttributeTypes = EditableIntAttribute | EditableValueListAttribute | EditableGridAttribute; // Attribute types are defined below...

interface Attribute {
	key: string;
	value: string;
	editable?: EditableAttributeTypes; // <- the editable attribute information
	// ...
}

`EditableIntAttribute`

This editable attribute represents a list of integer values. It expects the attribute value to be formatted as a string, starting with [ and ending with ], with all values separated by ,. Like the example below:

[1, 2, 3]

The typescript interface for the editable attribute is this:

interface EditableIntAttribute {
  input_type: 'int_list';
  min_value?: number = 0;
  max_value?: number = 100;
  step?: number = 1;
}

Both min_value and max_value define the accepted range of values, and step define the number to increment or decrement per step.

The default range of values is between 0 and 100, inclusive, and the default step is 1. Thus by default, the value will increment or decrement by 1 each time to a minimum of 0 and a maximum of 100.

Here is an example of what this attribute look like:

{
	"graphs": [{
		"nodes": [
			{
				"attrs": [
					{
						"key": "shape",
						"value": "[8, 8]",
						"editable": {
							"input_type": "int_list",
							"min_value": 8,
							"max_value": 64,
							"step": 8
						}
					}
				]
			}
		]
	}]
}

`EditableValueListAttribute`

This editable attribute define a fixed list of string values to display.

The typescript interface for the editable attribute is this:

interface EditableValueListAttribute {
	input_type: 'value_list';
	options: string[];
}

The options property provides the list of options to be displayed. The current value will be added to this list and any duplicates will be removed.

Here is an example of what this attribute look like:

{
	"graphs": [{
		"nodes": [
			{
				"attrs": [
					{
						"key": "chip_arch",
						"value": "wormhole",
						"editable": {
							"input_type": "value_list",
							"options": [
								"wormhole",
								"grayskull"
							]
						}
					}
				]
			}
		]
	}]
}

`EditableGridAttribute`

The grid attribute is similar to to the integer list, with the main difference that you can specify a separator for the place the list will be split, and it doesn't need to be enclosed in bracket ([ and ]). The data for a grid attribute looks like this:

4x4x2

The typescript interface for the editable attribute is this:

interface EditableGridAttribute {
  input_type: 'grid';
  separator?: string = 'x';
  min_value?: number = 0;
  max_value?: number = 100;
  step?: number = 1;
}

Both min_value and max_value define the accepted range of values, and step define the number to increment or decrement per step.

The separator attribute defines the character used to split the string, it defaults to "x".

Here is an example of what this attribute look like:

{
	"graphs": [{
		"nodes": [
			{
				"attrs": [
					{
						"key": "grid",
						"value": "4x4",
						"editable": {
							"input_type": "grid",
							"min_value": 4,
							"max_value": 64,
							"step": 4,
							"separator": "x"
						}
					}
				]
			}
		]
	}]
}

Attribute display type

To change how the attribute is displayed from plain text to something else, we do extend the attribute interface (presented above) with the display_type optional field.

type AttributeDisplayType = 'memory';

interface Attribute {
	key: string;
	value: string;
	display_type?: AttributeDisplayType; // <- Optional, add a different display type.
	// ...
}

If the display_type attribute is present, and it matches one of the available values, then the attribute will display differently than the others.

In the example below, the two attributes have different display types, one shows the regular, plain text display; and the other shows the memory display type, which renders it as a progress bar.

Example of different attribute display types

`memory`

Setting the display type to memory will make the attribute try to render as a progress bar.

The UI will then check the value property in the attribute for the following conditions:

Is a double precision floating point number
Is not NaN
Is grater than or equal to 0
Is less than or equal to 1

If all of the conditions are true, then the value will be rendered as a progress bar.

`tt-explorer` Roadmap

Milestone 1 (v0.1)

Main Goal - Visualize & Execute

This will highlight half of the essential work that this tool should be able to do in both visualizing a model and executing it using the current TT-Forge stack. The frontend transformation of a model → TTIR will be done outside of the scope of tt-explorer at the moment. For this milestone tt-explorer will be able to spin up a host-side and a client-side instance. The tool will be able to ingest TTIR modules to produce a visual result, and be able to execute this module. Ambitiously, the performance traces should be collected back into tt-explorer to be displayed.

Tasks:

~~Load TTIR Modules and Visualize TTIR-Ops in Model Explorer~~
~~Create Extensible Notebook UX allowing for visualization and scripting capabilities~~
~~Add functionality to Model Explorer to load from re-compiled TTIR Modules (might be from JSON)~~
~~Add functionality to TT-MLIR to execute from Python Bindings~~
~~Create REST API skeleton in TT-Adapter~~
~~From REST API Call, Invoke python bindings to execute TTIR module using TT-Adapter~~
(If possible) Parse Perf Trace Artifact and visualize performance in Model-Explorer (as Node Data)

Milestone 2 (v0.2)

Main Goal - Model Editor

The primary function of tt-explorer is to visualize and edit the model according to what the user defines as overrides the automatically generated compiler results. This milestone highlights that functionality in tt-explorer, focusing around providing UI, TT-MLIR, and tt-explorer features that enable the user to edit and tune a model “in-loop” with the TT-Forge compiler.

Tasks:

~~Flesh out and test locations ID such that operations can be tracked through the compiler stack.~~
~~Use Loc IDs to bind TTIR Ops with Tracy Perf Trace Artifact, and send to Model-Explorer to visualize.~~
~~Implement Overrides Functionality into TT-MLIR, tracking based on Loc IDs.~~
~~Overhaul UI to enable editing node attributes, use these updated fields to send information back to tt-explorer via REST API (in the form of an Overrides JSON)~~
~~Parse Overrides JSON and apply Overrides over a REST API Call, visualize re-compiled graph now.~~
~~Provide REST API endpoint to provide “legal” options attached to Graph JSON.~~

Milestone 3 (v0.3+)

Main Goal - Matured Tool and Extensibility

The focus of this milestone is to transition tt-explorer from a prototype tool into a mature visualization and editing tool for “Human-In-Loop” compilation. The tool is now planned to made extensible for other dialects and entry points forecast into TT-MLIR (Jax, StableHLO, etc…) and development of the visualization components of the tool provide feedback to upstream repos like model-explorer. Here the focus is on providing extensible interfaces for new UI elements (in supporting multi-chip and beyond), REST API, and Overrides.

Tasks:

~~Begin adding new dialects like .ttm, .ttnn to Model Explorer so that complied results can be inspected and analyzed to optimize at different steps of the compiler.~~
~~Add Accuracy/Performance Overlays as Node Data into the Model Explorer graph to visualize execution results~~
Enable interaction with ttnn-visualizer and other TT Visualizer tools to provide a more detailed view of execution results.
Start introducing InterOp with builtin adapters in model-explorer to support visualizing models from FE.
Use split panes to display graph transformations occurring through compiler, leveraging multiple dialects.
To be defined later, depending on the growth of the MLIR Project

`tt-alchemist`

tt-alchemist is a code generation tool that converts MLIR models to executable C++ or Python solutions for Tenstorrent AI accelerators.

Support Matrix

Note: The tool is currently in development and is subject to frequent changes. Please refer to this document for most up-to-date information. Support matrix is provided below.

The following table summarizes the current support for code generation modes in tt-alchemist:

	C++	Python
standalone	✅ Supported	❌ Not yet supported
local	🟨 Experimental support	🟨 Experimental support

Modes:

standalone: Generates a self-contained solution with all necessary dependencies copied into the output directory. Useful for deployment and sharing.
local: Generates code that uses libraries from the source tree, minimizing duplication and disk usage. Useful for development and debugging.

Note: Python codegen currently supports a small subset of operations compared to C++. Full support is being actively worked on and is coming soon.

Usage

The tool is compiled into a C++ library, with a thin CLI wrapper written in Python. This means that it can be distributed both as a C++ library, and as a CLI tool via Python wheel mechanism.

Using via CLI

To use via CLI, it is suggested to build the tool from source. Alternatively, look for tt-alchemist artifacts within CI runs.

# Assuming the user had already built the tt-mlir compiler and turned on the python virtual env

# Build the tt-alchemist lib, package into Python wheel, and install to active env
cmake --build build -- tt-alchemist

For all available CLI options and usage instructions, run:

tt-alchemist --help

All APIs today accept a .mlir file that describe a model in TTIR dialect. Example usage:

# Generate a whole standalone C++ solution and run
tt-alchemist generate-cpp tools/tt-alchemist/test/models/mnist.mlir -o mnist_cpp --standalone
cd mnist_cpp
./run

# Similar to above, but use "local" libs from source dir - this saves on memory by not copying the whole dev package to the output dir
tt-alchemist generate-cpp tools/tt-alchemist/test/models/mnist.mlir -o mnist_cpp --local
cd mnist_cpp
./run

# Similarly for python
tt-alchemist generate-python tools/tt-alchemist/test/models/mnist.mlir -o mnist_python --local
cd mnist_python
./run

# Following APIs are intended to be used for debugging purposes

# Convert a mlir file to C++ code and print to console
tt-alchemist model-to-cpp tools/tt-alchemist/test/models/mnist.mlir

# Same, but for python (current support limited to few ops)
tt-alchemist model-to-python tools/tt-alchemist/test/models/mnist.mlir

Usage via lib

To use within another project (e.g. a frontend like tt-xla), build the library from source:

# Assuming the user had already built the tt-mlir compiler and turned on the python virtual env

# Build the tt-alchemist lib
cmake --build build -- tt-alchemist-lib

Then, you may call any of the APIs listed here.

ttnn-standalone

ttnn-standalone is a post-compile tuning/debugging tool.

Forge and third party ML models (PyTorch, Jax, ONNX, ...) can be compiled to a set of TTNN library op calls in C++. This generated code can then be used outside of the compiler environment. ttnn-standalone tool offers all the scaffolding needed to run the C++ code on device (build & run scripts).

Usage

# 1. Convert a model from TTIR dialect to EmitC dialect using ttmlir-opt
# 2. Translate the resulting EmitC dialect to C++ code using ttmlir-translate
# 3. Pipe the generated C++ code to a .cpp file
ttmlir-opt \
  --ttir-to-emitc-pipeline \
  test/ttmlir/EmitC/TTNN/sanity_add.mlir | \
ttmlir-translate \
  --mlir-to-cpp > \
  tools/ttnn-standalone/ttnn-standalone.cpp

# 1. Change dir to `tools/ttnn-standalone`
# 2. Use `run` script to compile and run the compiled binary
cd tools/ttnn-standalone
./run

Note: if you receive this error

-bash: ./run: Permission denied

running chmod +x run will set the execute permission on the script.

Optimizer

Optimizer is the main component responsible for performance. It is a collection of passes with the two most important purposes being optimizing tensor memory layouts and selecting optimal operation configurations.

Prerequisites

To use the optimizer:

A physical Tenstorrent device must be present on the machine
Build of tt-mlir must be with OpModel support enabled:

cmake -G Ninja -B build -DTTMLIR_ENABLE_OPMODEL=ON

Basic Usage

Optimizer is disabled by default. To enable it, use the enable-optimizer option:

ttmlir-opt --ttir-to-ttnn-backend-pipeline="enable-optimizer=true" input.mlir

Optimizer Options

The optimizer provides additional configuration options:

enable-optimizer (default: false)
- Enables the optimizer pass
- Must be set to true to use any other optimizer options
memory-layout-analysis-enabled (default: true)
- Enables memory layout optimization
- Shards tensors to maximize usage of fast L1 memory instead of DRAM
max-legal-layouts (default: 64)
- Maximum number of different layouts to generate for each operation during analysis
- Higher values may provide better results but increase compile time

Example

# Enable optimizer with default settings
ttmlir-opt --ttir-to-ttnn-backend-pipeline="enable-optimizer=true memory-layout-analysis-enabled=true max-legal-layouts=8" input.mlir

PyKernel Guide

PyKernel is a Python interface for developing custom TTNN operations for Tenstorrent's AI accelerators. This guide explains how to use the PyKernel interface to implement your own TTNN operations.

Introduction to PyKernel

PyKernel provides a Python-based framework to define hardware-specific kernels that can be used with the TTNN framework. It allows developers to implement custom operations by defining compute kernels, reader/writer kernels, and control logic in a high-level Python interface.

The PyKernel framework consists of:

PyKernelOp: Base class that manages kernel selection, compilation, and execution
AST module: Decorators and utilities for defining kernels
Types module: Type definitions for PyKernel operations

PyKernel Architecture

Foundationally, PyKernel is a compiler built on top of 3 core components, described below.

Python `ast` Frontend

The frontend of PyKernel is made to parse Python code, the behaviour is enabled through using the ast (Abstract Syntax Tree) parser builtin to Python. By walking through the AST produced by this module, a MLIR module is created with the ttkernel dialect (including others like arith, memref, scf). This MLIR module is then piped into the next step of the PyKernel compiler. For more information about the type of kernel code that can be parsed by the Frontend, refer to the ttkernel Op spec.

Direct To Metal (D2M) Kernel Code Generation

Another component of the tt-mlir project that PyKernel is built on is the D2M compiler infrastructure. This infrastructure is made to dynamically create Kernels to performantly execute ML models. By replacing the entry point with the custom MLIR module created by the PyKernel Frontend, the same backend can be leveraged. This backend will take the MLIR module and run it through a series of rewritter passes such that it gets lowered to emitc, and eventually translated to C++ code. This C++ code is the artifact that is consumed by the runtime to execute on Tenstorrent Hardware.

TTNN Generic Op

TTNN comprises of python bound precompiled kernels and factories that operate in a manner similar to PyTorch. The Generic Op builds one step on top of this, intaking and operating on TTNN tensors and primitives, but has a completely undefined factory and set of kernels, these must be provided into the generic op such that it can operate. PyKernel leverages this generality to deploy it's dynamically compiled C++ Kernels into the Generic Op and interface with TTNN data as if a "custom" op was implemented. This is the glue that binds all of the compiler together.

Prerequisites

Before using PyKernel, ensure your environment is set up with:

TT-MLIR built and installed
Python 3.10 or newer
Required Python packages
TTMLIR_ENABLE_RUNTIME and TTMLIR_ENABLE_PYKERNEL flags set during build

Creating a Custom PyKernel Operation

To create a custom PyKernel operation, you need to:

Create a class that inherits from PyKernelOp
Implement the define_core_ranges method to specify the grid of cores for the operation
Define kernels using the @compute_thread(), @reader_thread(), or @writer_thread() decorators
Implement the invoke method to create and connect kernels
Define necessary circular buffers
Create a program descriptor that combines kernels and circular buffers

Basic Structure

from pykernel.ast import *
from pykernel.op import PyKernelOp
from pykernel.types import *

import ttnn
import torch

class MyCustomOp(PyKernelOp):
    # Define Core Grid
    def define_core_ranges(self, tensors, options):
        # Your logic to determine the core ranges
        core_1 = ttnn.CoreCoord(0, 0)
        core_2 = ttnn.CoreCoord(1, 1)
        return ttnn.CoreRangeSet([ttnn.CoreRange(core_1, core_2)])

    # Define compute kernel with appropriate decorator
    @compute_thread()
    def my_compute_kernel(cb_in: CircularBuffer, cb_out: CircularBuffer,
                         per_core_block_cnt: CompileTimeValue,
                         per_core_block_dim: CompileTimeValue):
        # Kernel processing code here
        return

    # Define reader kernel
    @reader_thread()
    def reader_kernel(cb_in: CircularBuffer, cb_out: CircularBuffer,
                     src_addr, num_tiles, start_id,
                     src_is_dram: CompileTimeValue):
        # Reader kernel code here
        return

    # Define writer kernel
    @writer_thread()
    def writer_kernel(cb_in: CircularBuffer, cb_out: CircularBuffer,
                     dst_addr, num_tiles, start_id,
                     dst_is_dram: CompileTimeValue):
        # Writer kernel code here
        return

    # The invoke method is the main entry point for kernel execution
    def invoke(self, in_tensor, out_tensor, **options):
        # Create circular buffers for input and output tensors
        cb_in = self.create_cb(in_tensor, 0)
        cb_out = self.create_cb(out_tensor, 1)

        # Prepare parameters for kernels
        start_id = 0
        is_dram = in_tensor.memory_config().buffer_type == ttnn.BufferType.DRAM
        num_tiles = options["num_tiles"]

        # Create kernels with appropriate parameters
        kernels = [
            self.create_kernel(
                MyCustomOp.my_compute_kernel,
                cb_in, cb_out,
                per_core_block_cnt=num_tiles,
                per_core_block_dim=1
            ),
            self.create_kernel(
                MyCustomOp.writer_kernel,
                cb_in, cb_out,
                out_tensor.buffer_address(),
                num_tiles, start_id,
                dst_is_dram=is_dram
            ),
            self.create_kernel(
                MyCustomOp.reader_kernel,
                cb_in, cb_out,
                in_tensor.buffer_address(),
                num_tiles, start_id,
                src_is_dram=is_dram
            )
        ]

        # Create and return the program descriptor
        return self.create_program(kernels, [cb_in, cb_out])

Kernel Types

PyKernel supports different types of kernels:

Compute Kernels: Process data on the compute units (e.g., SFPU - Scalar Floating-Point Unit)
Reader Kernels: Transfer data from memory to circular buffers
Writer Kernels: Transfer data from circular buffers to memory

Each kernel type has a specific decorator:

@compute_thread() - For compute kernels that run on TenSix cores
@reader_thread() - For reader kernels that transfer data from memory to circular buffers
@writer_thread() - For writer kernels that transfer data from circular buffers to memory

These decorators handle the compilation of Python code into hardware-specific kernels. You can also use the older style decorators if needed:

@ttkernel_tensix_compile() - Equivalent to @compute_thread()
@ttkernel_noc_compile() - For both reader and writer kernels

Runtime Arguments

In PyKernel, you can pass runtime arguments to your kernels to control their behavior on a per-core basis. There are two types of runtime arguments:

Single-Core Arguments (Common Runtime Arguments): These are scalar values (integers) that are broadcast to all cores in the grid. They are passed as common_runtime_args to the create_kernel method.
Multi-Core Arguments (Runtime Arguments): These are lists of lists of integers, where each inner list corresponds to a core in the grid. This allows you to provide different values for each core. They are passed as runtime_args to the create_kernel method.

Single-Core Arguments

Single-core arguments are useful when all cores need the same value for a particular parameter. For example, num_tiles_per_core in the VecAdd example is a single-core argument because each core processes the same number of tiles.

Multi-Core Arguments

Multi-core arguments are necessary when each core requires a unique value. A common use case is distributing work across cores, where each core needs a different start_id to process its portion of the data. In the VecAdd example, start_id_multicore is a multi-core argument.

Default Core Range Behavior

If you do not override the define_core_ranges method in your PyKernelOp class, it will default to a single core at (0, 0). This is suitable for single-core operations like the EltwiseSFPU demo, where the entire operation runs on a single core.

Circular Buffers

Circular buffers are used to transfer data between kernels and memory. In the PyKernel framework, there are two aspects of circular buffers:

CircularBuffer class: Used in kernel definitions to represent a circular buffer
CB Descriptors: Used at runtime to configure the actual hardware circular buffers

CircularBuffer Class

The CircularBuffer class is defined in pykernel.types and is used in kernel definitions:

class CircularBuffer:
    def __init__(self, cb_id, tensor_shape=(8, 128, 128), dtype="Float32"):
        self.cb_id = cb_id
        self.tensor_shape = tensor_shape
        self.tile_shape = 32  # default to 32x32 tile shape
        self.tilized_shape = self.get_tilized_memref_shape()
        self.dtype = dtype

Creating Circular Buffers in the Invoke Method

In your custom operation's invoke method, you can create circular buffers using the create_cb helper method from the PyKernelOp base class:

def invoke(self, in_tensor, out_tensor, **options):
    cb_in = self.create_cb(in_tensor, 0)  # buffer_index=0
    cb_out = self.create_cb(out_tensor, 1)  # buffer_index=1

    # Use cb_in and cb_out in kernel creation
    # ...

    return self.create_program(kernels, [cb_in, cb_out])

The create_cb method handles the creation of the necessary format descriptors and buffer descriptors based on the tensor properties:

Kernel Decorator Options

The kernel decorators (@compute_thread, @reader_thread, and @writer_thread) accept two optional boolean arguments:

verbose: When set to True, the PyKernel compiler will print the generated MLIR and the Python AST (Abstract Syntax Tree) during compilation. This is useful for debugging.
optimize: When set to True, the PyKernel compiler will run an optimization pipeline on the generated MLIR before converting it to C++. This can improve the performance of your kernel.

Example: Vector Add Operation

The VecAdd operation adds two tensors element-wise. Let's examine a complete implementation based on the demo in test/pykernel/demo/vecadd_multicore_demo.py:

1. Define the Operation Class

from pykernel.ast import *
from pykernel.op import PyKernelOp
from pykernel.types import *

import ttnn
import torch

class VecAddMulticorePyKernelOp(PyKernelOp):
    # Kernel implementations will go here

2. Define Core Ranges

The define_core_ranges method specifies the grid of cores that the operation will run on.

def define_core_ranges(self, tensors, options):
    core_0 = ttnn.CoreCoord(0, 0)
    if self.max_core_ranges is None:
        core_1 = ttnn.CoreCoord(1, 1)
    else:
        core_1 = self.max_core_ranges
    return ttnn.CoreRangeSet([ttnn.CoreRange(core_0, core_1)])

3. Define the Compute Kernel

@compute_thread()
def add_multicore(
    cb_in0: CircularBuffer,
    cb_in1: CircularBuffer,
    cb_out: CircularBuffer,
    num_tiles,
    start_tile_id,
):
    binary_op_init_common(cb_in0, cb_in1, cb_out)
    add_tiles_init(cb_in0, cb_in1)

    end_tile_id = start_tile_id + num_tiles
    dst_reg = 0

    for i in range(start_tile_id, end_tile_id, 1):
        cb_wait_front(cb_in0, 1)
        cb_wait_front(cb_in1, 1)
        tile_regs_acquire()
        add_tiles(cb_in0, cb_in1, 0, 0, dst_reg)
        tile_regs_commit()

        cb_reserve_back(cb_out, 1)
        tile_regs_wait()
        pack_tile(dst_reg, cb_out, 0)
        tile_regs_release()

        cb_push_back(cb_out, 1)
        cb_pop_front(cb_in0, 1)
        cb_pop_front(cb_in1, 1)
        tile_regs_release()
    return

4. Define Writer Kernel

@writer_thread()
def writer_multicore(
    cb_out: CircularBuffer,
    dst_addr,
    num_tiles,
    start_id,
    dst_is_dram: CompileTimeValue,
):
    onetile = 1
    tile_bytes = get_tile_size(cb_out)
    dataformat = get_dataformat(cb_out)

    s0 = get_interleaved_addr_gen_fast(
        dst_is_dram, dst_addr, tile_bytes, dataformat
    )

    end_id = start_id + num_tiles
    for i in range(start_id, end_id, onetile):
        cb_wait_front(cb_out, onetile)
        l1_read_addr = get_read_ptr(cb_out)
        noc_async_write_tile(i, s0, l1_read_addr)
        noc_async_write_barrier()
        cb_pop_front(cb_out, onetile)
    return

5. Define Reader Kernel

@reader_thread()
def reader_binary_interleaved(
    cb_in0: CircularBuffer,
    cb_in1: CircularBuffer,
    src_addr0,
    src_addr1,
    num_tiles,
    start_id,
    src0_is_dram: CompileTimeValue,
    src1_is_dram: CompileTimeValue,
):
    onetile = 1
    tile_bytes0 = get_tile_size(cb_in0)
    dataformat0 = get_dataformat(cb_in0)

    s0 = get_interleaved_addr_gen_fast(
        src0_is_dram, src_addr0, tile_bytes0, dataformat0
    )

    tile_bytes1 = get_tile_size(cb_in1)
    dataformat1 = get_dataformat(cb_in1)

    s1 = get_interleaved_addr_gen_fast(
        src1_is_dram, src_addr1, tile_bytes1, dataformat1
    )

    end_id = start_id + num_tiles
    for i in range(start_id, end_id, onetile):
        cb_reserve_back(cb_in0, onetile)
        cb_reserve_back(cb_in1, onetile)

        src0_write_addr = get_write_ptr(cb_in0)
        src1_write_addr = get_write_ptr(cb_in1)

        noc_async_read_tile(i, s0, src0_write_addr)
        noc_async_read_tile(i, s1, src1_write_addr)

        noc_async_read_barrier()
        cb_push_back(cb_in0, onetile)
        cb_push_back(cb_in1, onetile)
    return

6. Implement the Invoke Method

The invoke method is the critical part that connects the kernels together and creates the program descriptor:

def invoke(self, a_tensor, b_tensor, out_tensor):
    # Create circular buffers
    cb_in0 = self.create_cb(a_tensor, 0)
    cb_in1 = self.create_cb(b_tensor, 1)
    cb_out = self.create_cb(out_tensor, 2)

    # Set up parameters
    is_a_dram = a_tensor.memory_config().buffer_type == ttnn.BufferType.DRAM
    is_b_dram = b_tensor.memory_config().buffer_type == ttnn.BufferType.DRAM
    is_out_dram = out_tensor.memory_config().buffer_type == ttnn.BufferType.DRAM

    num_tiles = ceil(max(map(lambda t: t.volume(), [a_tensor, b_tensor, out_tensor])) / 1024)
    num_cores = self.get_core_ranges().num_cores()
    num_tiles_per_core = int(num_tiles / num_cores)

    # Define the multicore runtime arguments
    start_id = 0
    start_id_multicore = []
    bb = self.get_core_ranges().bounding_box()
    for i in range(bb.start.x, bb.end.x + 1):
        start_id_multicore.append([])
        for j in range(bb.start.y, bb.end.y + 1):
            start_id_multicore[-1].append([start_id])
            start_id += 1

    # Create kernels with appropriate parameters
    kernels = [
        self.create_kernel(
            VecAddMulticorePyKernelOp.add_multicore,
            cb_in0,
            cb_in1,
            cb_out,
            num_tiles_per_core,
            start_id_multicore,
        ),
        self.create_kernel(
            VecAddMulticorePyKernelOp.writer_multicore,
            cb_out,
            out_tensor.buffer_address(),
            num_tiles_per_core,
            start_id_multicore,
            dst_is_dram=is_out_dram,
        ),
        self.create_kernel(
            VecAddMulticorePyKernelOp.reader_binary_interleaved,
            cb_in0,
            cb_in1,
            a_tensor.buffer_address(),
            b_tensor.buffer_address(),
            num_tiles_per_core,
            start_id_multicore,
            src0_is_dram=is_a_dram,
            src1_is_dram=is_b_dram,
        ),
    ]

    # Create and return the program descriptor
    return self.create_program(kernels, [cb_in0, cb_in1, cb_out])

Running the VecAdd Demo

The VecAdd demo demonstrates adding two tensors element-wise. This can be run using the pykernel-demo target:

source env/activate
# Ensure the TTMLIR_ENABLE_RUNTIME and TTMLIR_ENABLE_PYKERNEL flags are set during build
cmake --build build -- pykernel-demo

Demo Breakdown

Let's examine how to use the PyKernel operation in practice:

# Open a device
device = ttnn.open_device(device_id=0)

# Define tensor shapes and data
num_tiles = 4
shape = [1, num_tiles, 32, 32]
data = torch.rand(shape).to(torch.bfloat16)
data2 = torch.rand(shape).to(torch.bfloat16)


# Configure memory
dram_memory_config = ttnn.DRAM_MEMORY_CONFIG

# Create input tensors
a_tensor = ttnn.from_torch(
    data,
    dtype=ttnn.bfloat16,
    layout=ttnn.TILE_LAYOUT,
    device=device,
    memory_config=dram_memory_config,
)

b_tensor = ttnn.from_torch(
    data2,
    dtype=ttnn.bfloat16,
    layout=ttnn.TILE_LAYOUT,
    device=device,
    memory_config=dram_memory_config,
)


# Create output tensor
output_tensor = ttnn.allocate_tensor_on_device(
    ttnn.Shape(shape),
    ttnn.bfloat16,
    ttnn.TILE_LAYOUT,
    device,
    dram_memory_config,
)

# Create the custom operation
vecadd_op = VecAddMulticorePyKernelOp()

# Execute the operation with the tensors and options
output = vecadd_op(a_tensor, b_tensor, output_tensor)

# Compare with the built-in add operation
golden = ttnn.add(a_tensor, b_tensor)

# Convert to torch tensors for comparison
torch_golden = ttnn.to_torch(golden)
torch_output = ttnn.to_torch(output)

# Verify results
matching = torch.allclose(torch_golden, torch_output)
print(f"Tensors are matching: {matching}")
assert matching

This demo shows the complete workflow:

Opens a device
Creates input and output tensors with appropriate memory configuration
Instantiates the VecAddMulticorePyKernelOp class
Executes the operation by calling the op with tensors
Compares the result with the built-in TTNN implementation

Comparison with Native TTNN Operations

PyKernel operations integrate seamlessly with native TTNN operations. As shown in the demo, you can compare your custom PyKernel operation with built-in TTNN operations:

# Execute your custom PyKernel operation
output = vecadd_op(a_tensor, b_tensor, output_tensor)

# Execute the equivalent built-in TTNN operation
golden = ttnn.add(a_tensor, b_tensor)

# Convert both to torch tensors for comparison
torch_golden = ttnn.to_torch(golden)
torch_output = ttnn.to_torch(output)

# Verify the results match
matching = torch.allclose(torch_golden, torch_output)
print(f"Tensors are matching: {matching}")
assert matching

This approach allows you to:

Validate your custom operation against known implementations
Benchmark performance differences between custom and built-in operations
Extend the TTNN framework with operations not available in the standard library

Building and Testing

To build and test PyKernel, you need to enable both the runtime and PyKernel components:

source env/activate

# Configure with PyKernel enabled
cmake -G Ninja -B build \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_C_COMPILER=clang-17 \
    -DCMAKE_CXX_COMPILER=clang++-17 \
    -DTTMLIR_ENABLE_RUNTIME=ON \
    -DTTMLIR_ENABLE_PYKERNEL=ON

# Build the project
cmake --build build

# Run the PyKernel demo
cmake --build build -- pykernel-demo

The TTMLIR_ENABLE_RUNTIME and TTMLIR_ENABLE_PYKERNEL flags are essential for PyKernel functionality. Without these flags, the PyKernel components will not be built.

Best Practices

When developing with PyKernel, follow these best practices:

Separate concerns: Keep compute, reader, and writer kernels separate for better maintainability and reusability
Use appropriate decorators: Apply the correct decorator for each kernel type:
- @compute_thread() for compute kernels
- @reader_thread() for reader kernels
- @writer_thread() for writer kernels
Implement the invoke method properly: The invoke method is critical as it connects all components:
- Create circular buffers with appropriate parameters
- Set up kernel parameters correctly
- Create kernels with the right arguments
- Return a program descriptor that includes all kernels and circular buffers
Handle memory configurations: Be aware of memory types (DRAM vs L1) when creating kernels
Reuse kernels: Create reusable kernels for common operations to avoid code duplication
Leverage caching: PyKernelOp automatically caches compiled kernels for performance
Test thoroughly: Always compare results with reference implementations or built-in TTNN operations
Document parameters: Clearly document the expected parameters for your PyKernel operation

Summary

PyKernel provides a flexible and powerful way to implement custom operations for Tenstorrent hardware. By following the pattern outlined in this guide, you can create your own operations that integrate seamlessly with the TTNN framework.

Key components of the PyKernel framework:

PyKernelOp base class: Handles kernel management, compilation, and caching
Kernel decorators: @compute_thread(), @reader_thread(), and @writer_thread()
CircularBuffer class: Represents circular buffers in kernel definitions
invoke method: The critical implementation that connects kernels and creates the program

The workflow for creating a custom PyKernel operation is:

Create a class that inherits from PyKernelOp
Define compute, reader, and writer kernels with appropriate decorators
Implement the invoke method to create circular buffers and connect kernels
Use the operation by instantiating your class and calling it with tensors and options

With PyKernel, you can extend the TTNN framework with custom operations that leverage the full power of Tenstorrent hardware while maintaining a clean, high-level Python interface.

Creating Bug Repros for TTNN Using TT-MLIR Codegen

While developing in tt-mlir, it's not uncommon to encounter bugs originating in the TTNN library. To isolate and report such bugs, a practical approach is to use the C++ codegen feature (EmitC) to generate a minimal repro. This guide walks you through how to create such repros and integrate them into the tt-metal repository, where TTNN is developed.

Step-by-Step Guide

Note: If you run into issues while following these steps, check the Known Issues section at the end of this guide for common problems and solutions.

1. Generate C++ Code from TT-MLIR

Use the ttnn-standalone tool to run the compiler and emit C++ code.

📖 See ttnn-standalone for instructions on how to generate C++ code from your MLIR input using EmitC.

2. Scope Down the Repro

Once you've generated the C++ code:

Use the ttnn-standalone tool to run and debug it in isolation.
Reduce the repro to the minimal example that still triggers the bug.
Confirm the issue still reproduces reliably.

3. Clone the TT-Metal Repository

Clone the tt-metal repo:

git clone git@github.com:tenstorrent/tt-metal.git
cd tt-metal

4. Add the Repro to the GTest Infrastructure

Place your .cpp file in:

tests/ttnn/unit_tests/gtests/emitc/

and add it to the cmake file:

tests/ttnn/unit_tests/gtests/CMakeLists.txt

like so:

set(EMITC_UNIT_TESTS_SRC
    ${CMAKE_CURRENT_SOURCE_DIR}/emitc/test_sanity.cpp
    ${CMAKE_CURRENT_SOURCE_DIR}/emitc/your_test_name.cpp  # <<<===
)

Use the existing file test_sanity.cpp in that directory as a reference.

5. Modify the Repro for GTest

There are some modifications that need to be made in order to fit the GTest infra:

Convert the main() function to a TEST(...) macro:

TEST(EmitC, YourTestName) {
    // Your original main function body here
}

Remove any return statements from the TEST(...) function body.
Replace #include "ttnn-precompiled.hpp" with #include "emitc.hpp"

6. Build the TTNN EmitC Tests

First, activate the python virtual env, and set some env variables:

source python_env/bin/activate
export TT_METAL_HOME=$(pwd)
export PYTHONPATH=$(pwd)

Then, build the tests:

./build_metal.sh --build-ttnn-tests

Note: some unrelated gtests might fail here, we can ignore them.

7. Run the EmitC Unit Tests

To run all EmitC tests:

./build/test/ttnn/unit_tests_ttnn_emitc

To run a specific test:

./build/test/ttnn/unit_tests_ttnn_emitc --gtest_filter=EmitC.YourTestName

Create a branch with your changes.
Open a GitHub issue or comment on an existing one.
Link to your branch and include the instructions for running the repro

./build_metal.sh --build-ttnn-tests
./build/test/ttnn/unit_tests_ttnn_emitc
./build/test/ttnn/unit_tests_ttnn_emitc --gtest_filter=EmitC.YourTestName

Known Issues

Missing sfpi compiler or other dependencies If you encounter errors about a missing sfpi compiler or other system-level dependencies, refer to the tt-metal installation guide for instructions on installing the required packages.
TTNN test compilation failures If the build fails when compiling TTNN tests, inspect the specific tests that caused the failure. If the failures are unrelated to EmitC tests, they can typically be ignored — this is a known issue.

Python Bindings

This page aims to clarify, document, and de-mystify the tt-mlir python bindings. It will do so by first highlighting the mechanism with which these bindings are generated and exposed to users. It will then document the nuances of nanobind, and the different parts of these bindings that must be written in by hand. Finally, it will go through a hands-on example of how to add your own functionality to the tt-mlir python bindings.

`nanobind`

Nanobind is the successor of the ubiquitous pybind project. In almost the same syntactical form, it provides a framework to define InterOp between C++ and Python. For more information about nanobind specifically, I'd recommend reading through the documentation. MLIR (and by extension: tt-mlir) leverages nanobind to create bindings for the C++ framework of Dialects, Ops, Types, Attributes, and Passes to be used in Python.

MLIR in Python

This section highlights the machinery and configuration with which MLIR can be exposed to Python, while still maintaining functional interop with the C++ code. For more context and information feel free to read the MLIR Python Documentation.

C-API

While the documentation provides a very lack-lustre explanation as to why the C-API exists, I am here to provide my take on the existence and purpose of the MLIR CAPI.

RTTI

MLIR, being a part of the llvm-project, follows their "custom" RTTI. For this reason, the entire C++ portion of the project isn't built with RTTI to enable to custom functionality. nanobind, however, requires RTTI to perform a lot of the casting and transformation required to interop with Python. This conflict leads to the natural desire for an alternative.

C doesn't have RTTI, it's a stable language without the extra convenience and machinery presented in C++. If a C-API were present, the python bindings can link against the C-API, relying on externally defined NanobindAdaptors to do the type conversions using nanobind mechanisms instead of relying on the C++/LLVM RTTI for the Python bindings.

C++ ABI

The C++ Application Boundary Interface (ABI) proves to be a challenging barrier to accessing functionality from C++. Without a defined stable ABI, it becomes difficult to deal with some of the complexity required to package and InterOp with Python. Specifically, dealing with templates, inheritance, and RTTI can prove quite the challenge.

To simplify this process, C provides a relatively stable ABI. The C-API also acts as a wrapper around the complex C++ functions, providing a simple "trampoline" for Python to link against.

`nanobind` x C-API Functionality

In the previous section, I mentioned NanobindAdaptors. This file helps to define some of the key design decisions made when linking the Python bindings against the C-API instead of the underlying C++ API. Functionally, the Python bindings act as a "wrapper" around the CAPI, exposing the functionality through python.

`include/mlir-c/Bindings/Python/Interop.h`

This file is key to defining the InterOp between the C-API and Python w.r.t. maintaining and accessing information in a pointer. It exposes an AI that interfaces immediate data pointers with python capsules. PyCapsules are essentially thin wrappers around data pointers in Python. The critically contain data (void*), destructor method, and a name.

Within the Interop, the assumption is that the data's ownership and lifetime is managed by some bound object that was created in C++. This file merely provides the API with which the underlying data pointer is passed around as either a PyCapsule or the raw pointer, and this file provides the type conversion utilities to convert between Python and C from an underlying object.

`include/mlir/CAPI/Wrap.h`

This header defines the API to InterOp between C-API objects and their C++ equivalent. By calling wrap() on a C++ MLIR object to have the underlying data create a C-API object on the same memory, and unwrap() does it the other way around.

They key caveat with this wrapping/unwrapping is the ownership over the lifetime of the data itself. The constructors for almost all of the primitives have already been defined in C++. As such the syntax for creating a new C-API object is more the syntax of creating an object in C++ and wrapping it into a CAPI object. The lifetime of the pointer is therefore maintained by the CAPI object as it gets passed around in return objects.

`include/mlir/Bindings/Python/NanobindAdaptors.h`

As the CAPI object gets bounced around in memory, the ownership and lifetime of the data must eventually reach python to be controlled by the user. The implementation details are not relevant to this component as to how the data reaches python. This component provides the utility to create copies of the underlying data and send them through nanobind, effectively framing itself as the InterOp component between CAPI objects and their nanobind equivalents.

Through the carefully created contract between these components of the MLIR project, the IR primitives are exposed to Python, created in C++, and bounced off of the C-API. While I may have gleaned over the other supporting mechanisms in this explanation, explore the parent directories for these three files for a more detailed look into the semantics of ownership and such.

Defining the C-API.

For primitives to be defined for use in Python, they must first be implemented in C++. This is outside of the scope of the Python specific code, please refer to the rest of tt-mlir documentation for references on this. Once the C++ functionality is defined, the C-API must be constructed on top of this to serve as the "InterOp" layer.

`get` & Constructing C-API Objects

Since most constructors for IR primitives are created in C++, the goal is to construct objects in C++, but have the ownership exposed to Python. We do this through the creation of a Get function. The get function will essentially intake primitive C-types, and invoke the ::get operator in C++ to construct the object. A simple code example for the ttkernel.TileType is shown below:

include/ttmlir-c/TTTypes.h


// We export the function outside of the scope of "C" such that it can be defined later using C++ methods.

MLIR_CAPI_EXPORTED MlirType ttmlirTTTileTypeGet(MlirContext get, unsigned height, unsigned width, uin32_t dataType);

lib/CAPI/TTTypes.cpp


MlirType ttmlirTTTileTypeGet(MlirContext ctx, unsigned height, unsigned width, uint32_t dataType) {
    // We return the **wrapped** created C++ object, transferring Ownership to the C-API
    return wrap(
        TileType::get(
            unwrap(ctx), // Now we unwrap the MlirContext object to cast it to a mlir::MLIRContext object (w/o affecting ownership)
            llvm::SmallVector<std::int64_t>{height, width}, // We construct the list here since a list isn't natively defined in the C-API,
            static_cast<ttcore::DataType>(dataType) // Here we cast the int value to get the Enum value from `ttcore::DataType`
        ) // Invoking the builtin get operator to create and get the pointer for some object
    );
}

The key details to note are the reliance on C++ methods in the get definition like intiializer lists. By leveraging the InterOp the get method will return a pointer which can easily be represented in the C-API and owned as such, while masking the complexities of the C++ underneath from nanobind. Definitions such as these must either be written by hand (as shown above), or they can automatically be generated for certain IR primitives. We will learn more about that below.

Generating Bindings

This section will outline the mechanism with which bindings are generated, and the intricacies of this step.

Declaring Python Bindings

The first step to kicking off binding generation is to declare that they should exist for some dialect. MLIR provides a CMake module (AddMLIRPython) which exposes the following utility functions which can be declared to state what Python bindings are generated. For more information about the specific arguments and expected structure of these CMake functions refer to the AddMLIRPython module and python/CMakeLists.txt.

`declare_mlir_python_sources`

Overview
This function provides an interface to directly copy .py source files into the final built python module.

Key Arguments

ADD_TO_PARENT defines the Parent name to which this source will be added to, inheriting the location.

Usecases

We use it to declare generic "Parents" which contain the generated/declared python files from many of the submodules within the dialects.
We use it to directly copy over key test infrastructure like ttir_builder as purely python programmed modules.

`declare_mlir_dialect_python_bindings`

Overview
This function is the key to invoking the mechanism to generate python bindings from Tablegen Definitions.

Key Arguments

TD_FILE Relative to ROOT_DIR, where the Tablegen Definition file to build bindings off of is located. Note: This currently just forwards the TD files from include/ttmlir/Dialect.
SOURCES Raw python files associated with bindings. Note: These files will essentially forward the generated modules forward.
GEN_ENUM_BINDINGS_TD_FILE if GEN_ENUM_BINDINGS is ON, this will build enum bindings from the defined Tablegen file.
DIALECT_NAME What name the dialects should be generated under.

Usecases

We use this CMake function to define and generate the bindings for the ttkernel, ttir, tt, and ttnn dialects.

`declare_mlir_python_extension`

Overview
This is the CMake function used to link C++ Source Files + declared nanobinds into the generated python module.

Key Arguments

EMBED_CAPI_LINKS_LIBS This is to declare the libraries used to link against the CAPI in the bindings. Learn more in the CAPI section below.
PRIVATE_LINK_LIBS Declares other libraries that are linked against the Python bindings.

Usecases

We use this function to build and link all of our custom nanobinds and hand-written Type/Attr bindings into the ttmlir module.

`add_mlir_python_common_capi_library`

Overview
This function adds a shared library embedding all of the core CAPI libs needed to link against extensions.

`add_mlir_python_modules`

Overview
This is the final packaging function of the python bindings, linking all of the sources together and packaging it into a built module.

Building MLIR Primitives from Tablegen

The declare_mlir_dialect_python_bindings leverages a mechanism of the mlir-tblgen to build the python bindings for some defined dialect. What are the intricacies of this functionality?

`mlir-tblgen`

This tool parses .td Tablegen files to automatically generate C++ code to implement that functionality in MLIR. We leverage the Tablegen language to define our dialects in tt-mlir, and this tool is exactly what gets invoked to build and generate the code to functionally use this dialect in our codebase.

Trivial Constructors

To deal with automatically generating the functionality around an Operation, a certain amount of generality is needed to deem the problem trivial enough to generate. All of the IR primitives are thankfully able to be constructed from .td to their relevant C++ implementations. However, as shown in the TileType example, the conversion from simple C primitives (+ pre-defined MLIR C-API types) to C++ get functions isn't trivial. For this reason, we can start to analyze the IR primitives and deem which ones are trivial for C-API generation, and which must be implemented by hand.

enum
- The enum type can be considered very generic. With the underlying data storage type being integral values, and an optional String representation in MLIR. By iterating over all of the user defined enum values, a very trivial constructor can be made to automatically generate enums.
operation
- Operations are a unique case where the constructor isn't often generic enough; however, the OperationState exists as a strictly defined struct which contains all of the relevant construction details and implementation requirements for an operation. For this reason, while it is not trivial, it is generic enough that the OperationState can be relied on to form a mechanism which automatically generates C-API builders.
Types/Attributes
- Types and Attributes unfortunately receive the short end of the stick. Their constructors are wildly generic, and there is no baseline for what is required in the construction of a Type/Attr. For this reason, at the current moment these primitives aren't supported for automatic generation in mlir-tblgen, and must be defined by hand.

Writing Bindings

With the understanding that not all bindings can be automatically generated for us, we can head into the intricacies of defining your own bindings for Types/Attrs.

LLVM-Style Pythonic "Type Information" + Casting

An important caveat to introduce before entering the domain of writing our own bindings is the understanding of how MLIR approaches the problem of downcasting w.r.t. IR primitives. Considering the C-API doesn't have an inheritance structure, Python is required to uphold the inheritance structure and hold the type information such that casting is possible between primitives and their specific implementation (ex: going from MlirAttribute -> TTNNLayoutAttr).

This mechanism can be exposed to Python in multiple different ways, where MLIR supports a specific implementation of an mlir_attribute_class and mlir_type_class which intake 2 additional C-API functions. To initialize a class using this structure the following functions are required:

myAttributeGet: to construct the Type/Attr
myAttributeGetTypeID: provides a unique static TypeID for myAttribute
isAMyAttribute: boolean to see if higher level type is of the same type.

This will then provide an interface where in python a type can be cast by calling the constructor method of some downcasted type:

# Example to downcast using MLIR provided methods.
my_attribute = mlir.MyAttribute(attr: _mlir.ir.MlirAttribute)

Choosing a direct C++ structure instead of C-API

Those who are familiar with the tt-mlir python bindings may be aware that our code structure looks drastically different from this, why is that? The answer lies in the redundancy and lack of extensive use of the nanobind mechanisms around tt-mlir Python bindings.

As mentioned in the C-API section, the C-API is required to form the contract between C++ -> Python, to reduce the collisions with RTTI and the unstable ABI from C++. That being said, it's not unsupported to still directly access C++ members from nanobind and skip the C-API Builder functions, instead just opting to create in C++ directly and then wrap that result. This is the approach taken "consciously" in the tt-mlir python bindings.

What are the consequences of this design decision? The advantages?

Direct MLIR Casting Support

Instead of relying on Python for casting, and defining C-API functions to support this functionality; this approach allows us to directly use mlir::isa, mlir::cast, etc... in it's place.

For example, we support tt_attribute_class and tt_type_class, which leverage isa and dyn_cast to downcast to Types and Attrs by wrapping the Python types and operating on the underlying C++ types.

This also brings about some potential collisions with RTTI from nanobind. None are present in the bindings (as far as I know), but the bindings are exposed to this problem moving forward.

Simpler Initialization Structures

Instead of having to invoke a C-API function to define the get method in nanobind we can directly invoke the wrap(CppType::get(...)) functionality that the C-API ends up calling. The primary difference is the native support for complex data structures like vector and map through nanobind. Take for example the following initialization for an attribute:

// C-API Definition for myAttributeGet
MlirAttribute myAttributeGet(MlirContext ctx, int* array, size_t arraySize) {
    return wrap(MyAttribute::get(ctx, std::vector<int>{array, array + arraySize}));
}

// nanobind direct invocation
tt_attribute_class(m, "MyAttribute")
    .def_static("get", [](MlirContext ctx, std::vector<int> arr) {
        return wrap(MyAttribure::get(ctx, arr));
    })

// nanobind invocation through C-API
mlir_attribute_class(m, "MyAttribute", myAttributeGetTypeId, isAMyAttribute)
    .def_static("get", [](MlirContext ctx, std::vector<int> arr) {
        return myAttributeGet(ctx, arr.data(), arr.size());
    })
// Note: While this may seem like a trivial change, the cost for retaining the function signature in C begins to grow very quickly. Especially when considering maps and more complex data structures.

Again, this does come with some nuances w.r.t. the ABI, but for our simple usecase of the bindings it can be considered acceptable...

Wait... why are we still defining the CAPI Builders Then?

This leads to an underlying question: What's the point of still defining the CAPI functions if we actually never end up using them? The answer is that we would ideally still maintain the infrastructure to backtrack our changes if we end up making more extensive use of the Python bindings and come across nasty ABI/RTTI issues, or MLIR upstreams significant changes to the Python bindings where we would have to leverage their architecture. With regards to the latter, I have asked some of the contributors and received "iffy" responses, with the general answer being that major changes are not planned for the MLIR Python bindings infrastructure.

That being said, for the low low cost of a few redundant functions being defined, we have a clear backup route in case the Python bindings blow up in our faces. I do think this argument is built on significant personal opinion, in the future we may change the strategy for the bindings. For now, it makes the structure of our python code cleaner, while having a clear route forward if something breaks.

Each MLIR project I've used as a reference approaches the problems differently. AFAIK the bindings are generally defined however the end user desires to invoke them :)

General Structure

Considering that mlir-tblgen will handle the generation of the underlying C++ code, we only need to define the C Builders and the nanobinds for each of the Types/Attrs we would like to add.

This often comprises of the following contributions:

Declaring the C-API Header Function(s) in include/ttmlir-c
Defining the C-API Function(s) in lib/CAPI
Writing out the nanobind for that Type/Attr in python/.

Example: Defining `ttkernel` Python Bindings

In this section, we will go through a worked example on the different steps required to expose functionality for the TTKernel dialect.

We will continue while assuming that the TTKernel dialect has been defined using Tablegen and already has a valid target that compiles the C++ functionality. We will also assume that the current CMake build targets and functionality that uphold the rest of the ttmlir dialects already exists.
Declare and register the TTKernel dialect in the C-API by calling the MLIR_DECLARE_CAPI_DIALECT_REGISTRATION(TTKernel, ttkernel); macro in include/ttmlir-c/Dialects.h:

// File: include/ttmlir-c/Dialects.h

#include "mlir-c/IR.h"

#ifdef __cplusplus
extern "C" {
#endif

MLIR_DECLARE_CAPI_DIALECT_REGISTRATION(TTKernel, ttkernel);

#ifdef __cplusplus
}
#endif

Declare CAPI Builder for all of the Types (namely only CBType needs to be implemented) in include/ttmlir-c/TTKernelTypes.h

// File: include/ttmlir-c/TTKernelTypes.h

#include "ttmlir-c/Dialects.h"

#ifdef __cplusplus
extern "C" {
#endif

MLIR_CAPI_EXPORTED MlirType ttmlirTTKernelCBTypeGet(
    MlirContext ctx, uint64_t port, uint64_t address,
    MlirType memrefType);

#ifdef __cplusplus
}
#endif

Declare the CAPI builder target in lib/CAPI/CMakeLists.txt by adding TTKernelTypes.cpp as a source to TTMLIRCAPI.
Define the Dialect by formalling applying the generated Dialect type into the CAPI_DIALECT_REGISTRATION macro.

// File: lib/CAPI/Dialects.cpp

#include "ttmlir-c/Dialects.h"

#include "mlir/CAPI/Registration.h"
#include "ttmlir/Dialect/TTKernel/IR/TTKernel.h"

MLIR_DEFINE_CAPI_DIALECT_REGISTRATION(
    TTKernel, ttkernel, mlir::tt::ttkernel::TTKernelDialect)

Define the CAPI get method for CBType

// File: lib/CAPI/TTKernelTypes.cpp

#include "ttmlir-c/TTKernelTypes.h"
#include "mlir/CAPI/IR.h"
#include "mlir/CAPI/Support.h"

#include "ttmlir/Dialect/TTKernel/IR/TTKernelOpsTypes.h"

using namespace mlir::tt::ttkernel;

MlirType ttmlirTTKernelCBTypeGet(MlirContext ctx, MlirType memrefType) {
  return wrap(CBType::get(unwrap(ctx), mlir::cast<mlir::MemRefType>(unwrap(memrefType))));
}

Define the nanobind build target in python/CMakeLists.txt by adding ttkernel as a dialect, and providing TTkernelModule.cpp as a source for TTMLIRPythonExtensions.Main.

# Define ttkernel dialect
declare_mlir_dialect_python_bindings(
  ADD_TO_PARENT TTMLIRPythonSources.Dialects
  ROOT_DIR "${TTMLIR_PYTHON_ROOT_DIR}"
  TD_FILE dialects/TTKernelBinding.td
  SOURCES dialects/ttkernel.py
  DIALECT_NAME ttkernel
)

Create python/dialects/TTKernelBindings.td to forward the tablegen for TTKernel to the CMake dialect target:

include "ttmlir/Dialect/TTKernel/IR/TTKernelOps.td"

Create nanobind module for TTKernel Dialect in python/TTMLIRModule.cpp

// Representation of the Delta you have to add to TTMLIRModule.cpp in the correct locations
NB_MODULE(_ttmlir, m) {
  m.doc() = "ttmlir main python extension";

  m.def(
      "register_dialect",
      [](MlirContext context, bool load) {
        MlirDialectHandle ttkernel_handle mlirGetDialectHandle__ttkernel__();
        mlirDialectHandleRegisterDialect(ttkernel_handle, context);
        if (load) {
          mlirDialectHandleLoadDialect(ttkernel_handle, context);
        }
      },
      py::arg("context"), py::arg("load") = true);

  auto ttkernel_ir = m.def_submodule("ttkernel_ir", "TTKernel IR Bindings");
  mlir::ttmlir::python::populateTTKernelModule(ttkernel_ir);
}

Define populateTTKernelModule in python/TTKernelModule.cpp

// File: python/TTKernelModule.cpp
#include <vector>

#include "ttmlir/Bindings/Python/TTMLIRModule.h"

#include "mlir/CAPI/IR.h"
#include "ttmlir-c/TTKernelTypes.h"

#include "ttmlir/Dialect/TTKernel/IR/TTKernelOpsTypes.h"

namespace mlir::ttmlir::python {
void populateTTKernelModule(py::module &m) {
  tt_type_class<tt::ttkernel::CBType>(m, "CBType")
      .def_static("get",
                  [](MlirContext ctx, uint64_t port, uint64_t address,
                     MlirType memrefType) {
                    return ttmlirTTKernelCBTypeGet(ctx, port, address,
                                                   memrefType);
                    // Note that for more complex constructors / out of ease this could also be defined using the wrap(CBType::get) style constructor.
                  })
      .def_prop_ro("shape", [](tt::ttkernel::CBType &cb) {
            cb.getShape().vec();
        })
      .def_prop_ro("memref", &tt::ttkernel::CBType::getMemref);
}
} // namespace mlir::ttmlir::python

Finally, expose the built python bindings using a "trampoline" python file in python/dialects/ttkernel.py

from ._ttkernel_ops_gen import *
from .._mlir_libs._ttmlir import register_dialect, ttkernel_ir as ir

# Import nanobind defined targets into ttkernel.ir, and the rest of the generated Ops into the top-level ttkernel python module.

Concluding The Example

While there are quite a few steps for adding a whole new dialect, often times more than not you will only need a subset of these steps to add a new Type/Attr to some existing dialect. Even less to modify the signature of some existing Type/Attr in the bindings.

Using the Python Bindings

This section will cover the basics of using the Python bindings. I think the folks at MLIR have produced documentation that can help you get up to speed quickly. This section will go over some of the nuances of using the python bindings that ttmlir has defined explicitly.

Interfacing with Generated Op Classes

The unfortunate reality is that documentation for autogenerated Ops isn't present. Fortunately, argument names are preserved and the function structure can be invoked by leveraging the help function in python. Iteratively running through the functions you want to implement can be helpful.

MLIRModuleLogger

Almost all of the python bindings behave exactly as expected coming from the ttmlir python bindings. A weird addition I think would provide some more context on nanobind and managed memory would be the MLIRModuleLogger.

This class is defined in C++ to attach to an existing MLIRContext, adding hooks to save the module to a std::vector<std::string, std::string>. Binding this forward through nanobind requires some delicacy about the state of this MLIRModuleLogger object. It needs to modify memory managed by C++, but it attaches to a context that exists in Python. This state management is done through nanobind owning and managing a thinly wrapped pointer to the C++ object by setting the return_value policy.

Using the Python bindings when traversing frequently through memory outside of the IR primitives requires some delicacy to ensure data is preserved and the code functions as intended.

Flatbuffers

Flatbuffers are the binary serialization format used by TTMLIR and they currently come in a few flavors (designated by the file extension):

.ttsys: A system description file that is the mechanism for supplying target information to the compiler. These can be collected on a target machine and downloaded to a development machine to enable cross-compilation.
.ttnn: A compiled binary file intended to be loaded and executed by the TTNN backend runtime.
.ttb: A compiled binary file intended to be loaded and executed by the TTMetal backend runtime (Unsupported).

`CI`

Our CI infrastructure is currently hosted on cloud. Cloud machines are used and linked as GitHub runners.

CI is triggered by new pull request and on push into main (usually when PR is merged).

CI is designed to automatically collect analytics data for each workflow run, including test results and code coverage. It will also publish newest release of documentation on GitHub.

Builds

CI performs the following build jobs:

Release build "speedy" - release image optimized for speed.

Release build "tracy" - release image with runtime trace/debug capabilities including performance measurments.

Debug build with unit tests and test coverage.

CLang tidy

...and Tests

The build of tt-mlir is done using build-tt-mlir-action. Only the Debug build has a specific implementation because it is also used to run unit tests and collect and publish code coverage data. Code coverage is published on codecov along with its results and a link to detailed coverage information is attached as a comment to PR. Test results are published as workflow artifacts in raw format and as HTML test reports, where applicable. Currently, there are no plans to change the build process, except minor parameter modifications or added features such as release wheel publishing to tt-forge.

Testing

Testing is performed inside build-and-test.yml workflow as run-tests jobs. It uses a matrix strategy which means that multiple jobs are created and executed on multiple machines using the same job task.

Test Matrix

Each row in the matrix array represents one test that will execute on a specific machine. Example:

 {runs-on: n150,   name: "run",  suite: "runtime_debug",     image: "tracy",  type: "ttrt",    path: "Silicon", flags: "--non-zero", container-options: "--device /dev/tenstorrent/0"},
 {runs-on: llmbox, name: "perf", suite: "perf",              image: "tracy",  type: "ttrt",    path: "Silicon/TTNN/llmbox/perf", container-options: "--device /dev/tenstorrent/0 --device /dev/tenstorrent/1 --device /dev/tenstorrent/2 --device /dev/tenstorrent/3"},
 {runs-on: n150,   name: "perf", suite: "explorer",          image: "tracy",  type: "pytest",  path: "tools/explorer/test/run_tests.py", container-options: "--device /dev/tenstorrent/0"},

runs-on

Specify the machine on which the test suite will be executed. Currently supported runners are:

N150
N300
NXX0 - either N150 or N300
llmbox
tg - galaxy box

It is expected that list will expand soon as machines with blackhole chip family are added to the runner pool.

name

"name" has historic origins in its name. In reality it is the type of test to perform:

run - perform functional run, or just run tests
perf - collect performance data (and send them to analytics)

path

This field represents the path inside the tt-mlir repository where your tests resides. For ttrt test this is the relative path for generated mlir files inside the build/test/ttmlir directory. For pytest the path is relative to the repository root.

suite

This is the actual test name.

image

Specify which release build image to use. It can be:

speedy
tracy

Please take a look at the Builds section for a more detailed description of the builds.

type

Specify the type of test run. Currently supported:

pytest - run python tests using pytest
ttrt - run tests using ttrt tool
unit - run unit tests
builder - run builder tests and execute generated flatbuffers iff run-ttrt flag is set
ttnn_standalone - run ttnn_standalone sanity test
pykernel - run pykernel tests and runtime demo.

flags (optional)

Additional flags may be used when running tests. These are passed to ttrt or pytest as an additional parameter.

container-options (optional)

Each test runs in a docker container and this option specifies docker container options. It is mostly used to map TT hardware device to a docker container (for example: "--device /dev/tenstorrent/0"). If no value is passed, the default value will be used ("--device /dev/tenstorrent")

Adding New Test

Usually, it is enough to add a single line to the test matrix and your tests will become part of tt-mlir CI. Here is a checklist of what you should decide before adding it:

On which TT hardware should your tests should run? Put the specific hardware in "runs-on" field or NXX0 if you don't care. If you want your test to run on multiple hardware types add multiple lines to the matrix, one for each hardware type.
Are your test run with ttrt or pytest? Put this decision in "type" field.
Does your test generate performance report? If it does put name as "perf". If not put name as "run".
Use creativity and name your test. Write result of your hard intellectual work inside "suite" field.

Each line in matrix MUST be unique! Check if it is. If it is not, use more of your creative and intellectual energy to create better (at least different) name for "suite" field.

Consider

Here are few things to consider:

Design your ttrt test so it is generated with a -- check_ttmlir CMake target.
For pytest, use pytest test discovery to run all tests in subdirectories. In most cases there is no need for two sets of tests.
If you want to have separate test reports, do not add additional XML file paths and steps to upload these. Use test_report_path because it will be automatically picked up and sent to analytics.
If separate reports are required, treat them as different tests. Add an additional line to the test matrix.
If you need to add additional steps to the run-tests job, make sure it's necessary. Typically, it's not a good idea to add additional steps. If there's another way to achieve your goal, use that method instead. This is because each step is executed for each test in the test matrix. When you add additional steps your test might pass, but many other tests will randomly fail.

Additional Reading

This section contains pointers to reading material that may be useful for understanding the project.

MLIR

https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index.html
https://mlir.llvm.org/docs/Tutorials/Toy/
https://www.jeremykun.com/2023/08/10/mlir-getting-started/
https://arxiv.org/pdf/2002.11054
https://ieeexplore.ieee.org/abstract/document/9370308

Dialects

affine dialect
- Affine map is a really powerful primitive that can be used to describe most data movement patterns.
- It can also be used to describe memory layouts.
linalg dialect
tosa dialect
tosa spec
memref dialect
torch-mlir
onnx-mlir
triton-mlir

Tablegen

LLVM Testing Framework Tools

Jax
Flatbuffer
Openxla Website
openxla
StableHLO

Contributor Covenant Code of Conduct

Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

Our Standards

Examples of behavior that contributes to a positive environment for our community include:

Demonstrating empathy and kindness toward other people
Being respectful of differing opinions, viewpoints, and experiences
Giving and gracefully accepting constructive feedback
Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
Focusing on what is best not just for us as individuals, but for the overall community

Examples of unacceptable behavior include:

The use of sexualized language or imagery, and sexual attention or advances of any kind
Trolling, insulting or derogatory comments, and personal or political attacks
Public or private harassment
Publishing others' private information, such as a physical or email address, without their explicit permission
Other conduct which could reasonably be considered inappropriate in a professional setting

Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.

Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.

Scope

This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at nsmith@tenstorrent.com or staylor@tenstorrent.com. All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the reporter of any incident.

Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:

1. Correction

Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.

Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.

2. Warning

Community Impact: A violation through a single incident or series of actions.

Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.

3. Temporary Ban

Community Impact: A serious violation of community standards, including sustained inappropriate behavior.

Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.

4. Permanent Ban

Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.

Consequence: A permanent ban from any sort of public interaction within the community.

Attribution

This Code of Conduct is adapted from the Contributor Covenant, version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.

Community Impact Guidelines were inspired by Mozilla's code of conduct enforcement ladder.

For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.

Project Structure

env: Contains the environment setup for building project dependencies, such as LLVM and Flatbuffers
include/ttmlir: Public headers for the TTMLIR library
- Dialect: MLIR dialect interfaces and definitions, dialects typically follow a common directory tree structure:
  - IR: MLIR operation/type/attribute interfaces and definitions
  - Passes.[h|td]: MLIR pass interfaces and definitions
  - Transforms: Common MLIR transformations, typically invoked by passes
- Target: Flatbuffer schema definitions. This defines the binary interface between the compiler and the runtime
lib: TTMLIR library implementation
- CAPI: C API for interfacing with the TTMLIR library, note this is needed for implementing the python bindings. Read more about it here: https://mlir.llvm.org/docs/Bindings/Python/#use-the-c-api
- Dialect: MLIR dialect implementations
runtime: Device runtime implementation
- include/tt/runtime: Public headers for the runtime interface
- lib: Runtime implementation
- tools/python: Python bindings for the runtime, currently this is where ttrt is implemented
test: Test suite
tools/ttmlir-opt: TTMLIR optimizer driver

Namespaces

mlir: On the compiler side, we use the MLIR namespace for all MLIR types and operations and subnamespace for our dialects.
- mlir::tt: Everything ttmlir related is underneath this namespace. Since we need to subnamespace under mlir, just mlir::tt seemed better than mlir::ttmlir which feels redundant.
  - mlir::tt::ttir: The TTIR dialect namespace
  - mlir::tt::ttnn: The TTNN dialect namespace
  - mlir::tt::ttmetal: The TTMetal dialect namespace
  - mlir::tt::ttkernel: The TTKernel dialect namespace
tt::runtime: On the runtime side, we use the tt::runtime namespace for all runtime types and operations.
- tt::runtime::ttnn: The TTNN runtime namespace
- tt::runtime::ttmetal: The TTMetal runtime namespace (not implemented)

Dialects Overview

Here is a brief overview of the dialects in the project, please refer to the individual dialect documentation for more details.:

ttcore: Common types such as, ttcore.tile, ttcore.metal_layout, ttcore.grid, etc. and enums such as, data formats, memory spaces, iterator types etc.
ttir: A high level dialect that models the tensor compute graph on tenstorrent devices. Accepts tosa and linalg input.
- ttir.generic: Generically describe compute work.
- ttir.to_layout: Convert between different tensor memory layouts and transfer between different memory spaces.
- tensor.pad: Pad a tensor with a value (ie. convs)
- ttir.yield: return result memref of computation in dispatch region body, lowers to ttkernel.yield
- ttir.kernel: lowers to some backend kernel
ttnn: A TTNN dialect that models ttnn API.
ttkernel: Tenstorrent kernel library operations.
- ttkernel.noc_async_read
- ttkernel.noc_async_write
- ttkernel.cb_push_back
- ttkernel.[matmul|add|multiply]: Computations on tiles in source register space, store the result in dest register space.
- ttkernel.sfpu_*: Computations on tiles in dest register space using sfpu coprocessor.
ttmetal: Operations that dispatch work from host to device.
- ttmetal.enqueue_program: Dispatch a grid of compute work.

Guidelines

This page contains a collection of guidelines to help maintain consistency and quality across our project. Please refer to the following documents for detailed instructions on coding practices, as well as specific dialect guidelines.

TT-MLIR Coding Guidelines

This document outlines the coding standards used in the tt-mlir project. These guidelines are designed to enhance the readability and maintainability of our shared codebase. While these guidelines are not strict rules for every situation, they are essential for maintaining consistency across the repository.

Our long-term aim is to have the entire codebase adhere to these conventions.

Since our compiler is built on the LLVM MLIR framework, we strive to align closely with the LLVM coding style guidelines outlined here: LLVM Coding Standards.

Naming

Clear and descriptive names are crucial for code readability and preventing bugs. It’s important to choose names that accurately reflect the semantics and purpose of the underlying entities, within reason. Avoid abbreviations unless they are widely recognized. Once you settle on a name, ensure consistent capitalization throughout the codebase to avoid confusion.

The general naming rule is to use camel case for most names (for example, WorkaroundPass, isRankedTensor())

Type Names
- Applies to classes, structs, enums, and typedefs.
- Should be nouns that describe the entity's purpose.
- Use upper camel case (for example, TTNNOptimizerOptions, DecompositionPass).
Variable Names
- Should be nouns, as they represent state.
- Use lower camel case (for example, inputLayout).
Function Names
- Represent actions and should be verb phrases
- Use lower camel case (for example, createTTNNOptimizer(), emitTTNNAsCpp()).

Includes

We prefer #includes to be listed in this order:

Main Module Header
Local/Private Headers
LLVM project/subproject headers (clang/..., lldb/..., llvm/..., etc)
System #includes

Each category should:

Be sorted lexicographically by the full path.
Be separated by a single blank line for clarity.

Only the standard lib header includes should use <> whereas all the others should use quotes "". Additionally, all project headers must use absolute paths (rooted at ttmlir) to prevent preprocessor and namespacing issues. For example, the following is preferred:

#include "ttmlir/module/something.h"

over:

#include "something.h"

Using TTIRToTTNN.cpp as an example, this is what includes would look like for us:

#include "ttmlir/Conversion/TTIRToTTNN/TTIRToTTNN.h"  # main header

#include "ttmlir/Dialect/TTCore/IR/TTCoreOpsTypes.h"  # these are local/private headers
#include "ttmlir/Dialect/TTNN/Utils/Utils.h"

#include "mlir/Dialect/MemRef/IR/MemRef.h"  # llvm project/subproj headers
#include "llvm/Support/LogicalResult.h"

#include <cstdio>  # system includes
#include <algorithm>

Comments

Write comments as full sentences, starting with a capital letter and ending with a period. Comments should explain why the code exists, not just what it does. Use comments to clarify logic, assumptions, or any non-obvious aspects of the code.

Example of a comment:

// Initialize the buffer to store incoming data from the network.

In general, C++ style comments (//) should be used. Use C-style comments (/**/) only for when documenting the significance of constants used as actual parameters in a call:

object.callFunction(/*arg0=*/nullptr);

Every function, class, or non-trivial piece of logic should have a comment. Avoid redundant comments for self-explanatory code, but never leave complex code unexplained. Example of redundant comment:

// Increment the counter by 1.  // Redundant, avoid.
counter++;

Ensure comments are accurate and reflect the current state of the code. Outdated or misleading comments can be worse than no comments at all.

All TODO comments should be marked with an alias as follows:

// TODO (your-alias): Refactor this loop for clarity. Issue: https://github.com/tenstorrent/tt-mlir/issues/XYZ

A TT-MLIR issue should be created and linked inline to track the TODO.

Code Denesting (Inversion)

Strive to minimize unnecessary indentation without compromising code clarity. One effective way to achieve this is by using early exits and the continue keyword in long loops.

Consider following example:

void doSomething(Operation *op)
{
    if (op->getNumOperands() > 0
        && isDpsOp(op)
        && doSomethingDifferent(op))
    {
        // ... some long code ...
    }
}

It is strongly recommended to format the code as follows:

void doSomething(Operation *op)
{
    // ...
    // We need to do something with the op that has more than 0 operands
    if (op->getNumOperands() <= 0 ) return;

    // We need something to do with the DPS op
    if (!isDpsOp(op)) return;

    // Just for example purposes
    if (!doSomethingDifferent(op)) return;

    // .. some long code ...
}

This reduces loop nesting, makes the reasoning behind the conditions clearer, and signals to the reader that there is no subsequent else to worry about, reducing cognitive load. This can significantly improve code readability and comprehension.

Function Declaration and Definition Order

To improve code readability and maintainability, we should adopt a consistent approach for organizing function declarations and definitions within a file. The goal is to make it easier for readers to follow the logical flow of function dependencies.

Follow a bottom-up call order:

Arrange functions so that lower-level helper functions are defined first, followed by higher-level functions that call them.
This allows each function to be defined after its dependencies, making it clear which functions rely on which.
For example, if function A calls A1 and A2, then the preferred order is:

void A1();
void A2();
void A(){
  A1();
  A2();
}

Group related functions together:

If functions are only relevant to a specific “parent” function (for example, A1 and A2 are only called by A), place them directly before the “parent” function.
If a function (like A2) is also called by other functions (for example, B), place it where it fits the overall bottom-up order.

Avoid mixed ordering:

Mixing top-down and bottom-up call orders within the same file can make the code hard to read and maintain.

Example of a preferred order:

void A1() {
  /*...*/
}
void A2() {
  /*...*/
}
void B() {
  A2(); // A2 is defined before B, so dependencies are clear.
}
void A() {
  A1();
  A2();
  B();
}

Helper Functions

These coding guidelines address visibility and linkage of simple helper functions to ensure clarity, prevent linking errors, and improve maintainability:

If a helper function needs to be defined in a .cpp file, it should be declared static or wrapped inside an anonymous namespace.
If a helper function needs to be defined in a header file (for example, for templated or performance-critical code), it should be marked as inline.

[!NOTE] A significant concern with declaring functions as non-public (for example, static functions or functions in unnamed namespaces) is that they cannot be unit tested in isolation. This limitation hinders our ability to write focused, granular tests that verify the correctness of individual components and it also reduces test coverage.

Using Namespaces

Namespaces are an important part of C++ programming, providing a way to organize code and avoid naming conflicts. Choose namespace names that reflect the purpose or functionality of the code contained within.

Follow these guidelines when defining namespaces:

Use lower-case letters for short, single-word names or those with a clear acronym (for example, ttnn, mlir).
Use nested namespaces to group logically related code, avoiding too deep or unnecessarily complex hierarchy

Follow these guidelines when using namespaces:

Do not use a using-directive to make all names from a namespace available because it pollutes the namespace.

// Forbidden -- This pollutes the namespace.
using namespace std;

Avoid placing code in the global namespace to reduce the potential for name conflicts and ambiguity. Always use specific namespaces. If necessary to use something from the global namespace (such as std), use an explicit std:: prefix rather than importing everything using using namespace std;.
Do not use namespace aliases at namespace scope in header files except in explicitly marked internal-only namespaces, because anything imported into a namespace in a header file becomes part of the public API exported by that file.
Try to avoid mixing concepts from different namespaces in a single function or class. If a function belongs to one namespace but calls classes from others, ensure the relationships are clear.
Wrap classes/structs declared in .cpp files inside of an anonymous namespace to avoid violating ODR. See LLVM docs for more detailed information.

Using Alternative Tokens (`and`, `or`, `xor`, etc.)

Although they are standard, we should avoid their use. They are very rarely used in practice and the C++ community widely uses the standard operators (&&, ||, !, etc.), as they are more familiar and easily recognizable to most C++ developers. Their usage can make the code harder to read and maintain, especially for developers who are not familiar with these alternatives. We should stick to the standard operators (&&, ||, !, etc.) for clarity, consistency, and compatibility with other C++ developers and tools.

Type Aliasing

When declaring type aliases in C++ prefer using over typedef. using provides better readability, especially for complex types, and supports alias templates. Here is an example:

// Preferred
using Callback = void(*)(int, double);

// Avoid
typedef void (*Callback)(int, double);

Choose alias names that clarify their role in the code. Avoid overly generic names that might obscure the type’s purpose. Do not create a type alias unless it significantly improves clarity or simplifies complex types.

Using `auto` to Deduce Type

Use auto only when it enhances code readability or maintainability. Avoid defaulting to “always use auto.” Instead, apply it thoughtfully in the following scenarios:

When the type is immediately clear from the initializer, such as in cast(...).
When the type is obvious from the context, making the code cleaner and more concise.
When the type is already abstracted, such as with container typedefs like std::vector::iterator.

In all other cases, prefer explicit type declarations to maintain clarity and ensure the code remains easy to understand.

Python Coding Guidelines

Python Version and Source Code Formatting

The current minimum version of Python required is 3.10 or higher. Python code in the tt-mlir repository should only use language features available in this version of Python.

The Python code within the tt-mlir repository should adhere to the formatting guidelines outlined in PEP 8.

For consistency and to limit churn, code should be automatically formatted with the black utility, which is PEP 8 compliant. Use its default rules. For example, avoid specifying --line-length even though it does not default to 80. The default rules can change between major versions of black. In order to avoid unnecessary churn in the formatting rules, we currently use black version 23.x.

When contributing a patch unrelated to formatting, you should format only the Python code that the patch modifies. When contributing a patch specifically for reformatting Python files, use black, which currently only supports formatting entire files.

Here is a quick example, but see the black documentation for details:

$ black test.py                    # format entire file

TTNN Dialect Contribution Guidelines

This document provides clear and consistent guidelines for contributing to the TTNN dialect, including operations, attributes, types, and other components. Following these ensures a streamlined development process, faster code reviews, and higher-quality code with fewer bugs.

General Principle: Model TTNN Library Closely

The TTNN dialect should closely reflect the TTNN library wherever practical, serving as the core guiding principle when contributing to the dialect. Whenever there's a need to deviate from this principle, it should be discussed with stakeholders.

Ops and Operands

Signature Selection

Ops in TTNN may have multiple signatures available - it's important to choose the right one when creating its model in the TTNN dialect. Going through an example, these are the available signatures for the ttnn::transpose op:

struct ExecuteTranspose {
    static ttnn::Tensor invoke(
        uint8_t queue_id,
        const ttnn::Tensor& input_tensor,
        const int64_t& dim1,
        const int64_t& dim2,
        const std::optional<MemoryConfig>& memory_config_arg,
        const std::optional<float>& pad_value = 0.0f);

    static ttnn::Tensor invoke(
        const ttnn::Tensor& input_tensor,
        const int64_t& dim1,
        const int64_t& dim2,
        const std::optional<MemoryConfig>& memory_config,
        const std::optional<float>& pad_value = 0.0f);

    static ttnn::Tensor invoke(
        const ttnn::Tensor& input_tensor,
        const int64_t& dim1,
        const int64_t& dim2,
        const std::optional<float>& pad_value = 0.0f);
};

The first and second signature differ only in the queue_id parameter - we don't model queues today, so the second signature has priority here. The second and third signature differ in memory_config parameter - the second signature is preferred as it is more robust: the parameter is optional so it can remain unused if it isn't needed.

Only one signature should be chosen. If the need would arise for more than one signature, it would be a precedent, and should be discussed with stakeholders.

Operand ordering

Operands in the TTNN dialect ops should match the ordering of the signature of the op being modelled. For the chosen signature of the ttnn::transpose op, the operands should look like this:

let arguments = (ins AnyRankedTensor:$input,
                     SI64Attr:$dim0,
                     SI64Attr:$dim1,
                     OptionalAttr<TTNN_MemoryConfigAttr>:$memory_config,
                     OptionalAttr<FloatAttr>:$pad_value);

Mixing types and attributes within the ordering is not an issue, this is valid:

let arguments = (ins TTNN_ShapeAttr:$shape,
                     OptionalAttr<TT_DataTypeAttr>:$dtype,
                     OptionalAttr<TTNN_LayoutAttr>:$layout,
                     Optional<TT_Device>:$device,
                     OptionalAttr<TTNN_MemoryConfigAttr>:$memory_config);

Following this guideline provides consistency with the TTNN lib.

Optional operands

If an operand is optional in the TTNN lib, it should be modelled as optional in the dialect.

Default-valued operands

If an operand has a default value in the TTNN lib, it should have a default value in the dialect.

ttnn::permute as an example:

static ttnn::Tensor invoke(
    const ttnn::Tensor& input_tensor,
    ttsl::Span<const int64_t> dims,
    const std::optional<MemoryConfig>& memory_config,
    const std::optional<float>& pad_value = 0.0f);

let arguments = (ins AnyRankedTensor:$input,
                     DenseI64ArrayAttr:$permutation,
                     OptionalAttr<TTNN_MemoryConfigAttr>:$memory_config,
                     DefaultValuedOptionalAttr<F32Attr, "0.0f">:$pad_value);

Numerical operands

Numerical operands should match in signedness and bit width. If an operand is a signed integer of width of 32 bits, SI32Attr should be used to model it.

Pointers and references

Pointers and references should be ignored. We do not want to model this level of detail at this point in time.

There were very few issues with these previously, and they were caused by inconsistencies in TTNN lib APIs.

Attrs vs Types

General guideline is that if a value is known at compile time, it should probably be an Attr. Example: dims in transpose op, pooling windows in a conv, etc. If the value is unknown at compile time (e.g. tensor) it should be a Type.

There's another consideration to account for: does the value need its own SSA? Remember, Attrs need something to latch onto, like an op or a Type, but Types need to be constructed, i.e. have their own SSA, in order to exist. Let's look at ttnn::Shape for example - in TTNN lib, these need to be constructed, so it naturally follows that they should have their own SSA value within the IR, implying that they should be implemented as Types. However, there are several downsides to this:

More IR is produced
Diminished readability as they're not attached to the object whose shape they're describing
Not as easy to construct in code
Runtime would need to keep track of all the Shape objects (it currently maps all SSAs, which are currently only tensors and devices)

One upside for implementing ttnn::Shape as a Type is that it would enable optimizing out multiple constructor calls for the same Shape.

It is agreed that we should prefer using Attrs in these scenarios. However, this guideline is not set in stone - stakeholders should be notified if anyone believes there's a need to implement an object as a Type.

Destination-passing style (DPS)

If the op in TTNN lib has the destination tensor, is should be modelled as DPS op.

An example signature, where the last operand is a destination tensor:

static Tensor invoke(
    const Tensor& input_tensor,
    float exponent,
    const std::optional<MemoryConfig>& memory_config = std::nullopt,
    const std::optional<Tensor>& optional_output_tensor = std::nullopt);

Variadic operands

Variadic<> type constraint should only be used for operands that are variadic in nature, e.g. a vector of tensors, like in ttnn::concat:

static ttnn::Tensor invoke(
    const std::vector<ttnn::Tensor>& input_tensors,
    int dim,
    const std::optional<MemoryConfig>& memory_config = std::nullopt,
    const std::optional<ttnn::Tensor>& optional_output_tensor = std::nullopt,
    unsigned int groups = 1);

Operand naming

Operands should be named as they are in the TTNN lib. However, this guideline is not strict, and some reasonable deviations are acceptable.

Operand namespaces

Some operands are defined in a namespace nested within the TTNN namespace, i.e. ttnn::ccl::Topology, and some are in other but related namespaces, i.e. tt::tt_metal::MemoryConfig. While it would be ideal to model these completely accurately, it doesn’t provide value and we should pretend they’re all in the ttnn:: namespace for the sake of simplicity.

Adding an Op

This guide will walk you through the process of adding a new Op end to end in tt-mlir, in this case we will be adding a matmul operation. Note that the matmul op was added as part of the same changeset as this guide, it could be useful to reference the diff alongside this guide to see the changes in full.

This guide will cover the following steps:

Adding an Op

1. Define the Op in the TTIR frontend dialect

We will start by defining the Op in the TTIR dialect. The TTIR Ops are defined in a tablegen file located at include/ttmlir/Dialect/TTIR/IR/TTIROps.td.

Tablegen is a domain-specific language for defining ops/types/attributes in MLIR and LLVM, these definitions constitute the dialect's Operation Definition Specification (ODS).

Here is an example of defining matmul in the TTIR dialect:

def TTIR_MatmulOp : TTIR_NamedOp<"matmul"> {
    let summary = "Matrix multiplication operation.";
    let description = [{
      The `matmul` operation computes the matrix multiplication of two tensors.

      This operation performs matrix multiplication between tensors `a` and `b`. It supports optional
      transposition of either input tensor before multiplication. For 2D tensors, this computes the standard
      matrix product. For tensors with more dimensions, it applies batched matrix multiplication.

      Example:
      ```mlir
      // Basic matrix multiplication of 2D tensors
      %a = ... : tensor<3x4xf32>  // Matrix A with shape [3,4]
      %b = ... : tensor<4x5xf32>  // Matrix B with shape [4,5]
      %output = ttir.empty() : tensor<3x5xf32>  // Output matrix shape
      %result = ttir.matmul(%a, %b, %output) :
          tensor<3x4xf32>, tensor<4x5xf32>, tensor<3x5xf32> -> tensor<3x5xf32>

      // Batched matrix multiplication with transposition
      %a = ... : tensor<2x3x4xf32>  // Batch of 2 matrices with shape [3,4]
      %b = ... : tensor<2x5x4xf32>  // Batch of 2 matrices with shape [5,4]
      %output = ttir.empty() : tensor<2x3x5xf32>  // Output shape
      %result = ttir.matmul(%a, %b, %output) {
          transpose_a = false,  // Don't transpose A
          transpose_b = true    // Transpose B before multiplication
      } : tensor<2x3x4xf32>, tensor<2x5x4xf32>, tensor<2x3x5xf32> -> tensor<2x3x5xf32>
      ```

      Inputs:
      - `a` (Tensor): The first input tensor.
      - `b` (Tensor): The second input tensor.

      Attributes:
      - `transpose_a` (Boolean, default=false): Whether to transpose tensor `a` before multiplication.
      - `transpose_b` (Boolean, default=false): Whether to transpose tensor `b` before multiplication.

      Outputs:
      - `result` (Tensor): The result of the matrix multiplication.

      Note: The inner dimensions of the input tensors must be compatible for matrix multiplication.
      If `a` has shape [..., m, k] and `b` has shape [..., k, n], then the result will have shape [..., m, n].
      If `transpose_a` is true, then `a` is treated as having shape [..., k, m].
      If `transpose_b` is true, then `b` is treated as having shape [..., n, k].
    }];

    let arguments = (ins AnyRankedTensor:$a,
                         AnyRankedTensor:$b,
                         AnyRankedTensor:$output,
                         DefaultValuedAttr<BoolAttr, "false">:$transpose_a,
                         DefaultValuedAttr<BoolAttr, "false">:$transpose_b);

    let results = (outs AnyRankedTensor:$result);

    let hasVerifier = 1;

    let hasCanonicalizer = 1;
}

There are many things to break down here, starting from the top:

def in tablegen is used to define a concrete type, this will have a 1-1 mapping to a C++ generated class, and for this particular case the build will end up generating file build/include/ttmlir/Dialect/TTIR/IR/TTIROps.h.inc.
It inherits from class TTIR_DPSOp, classes in tablegen don't define a concrete type, but rather an interface that augment or constrain inherited defs. TTIR_DPSOp is a class that defines the common attributes for all TTIR Ops that implement Destination Passing Style (DPS) semantics. DPS just means that the result tensor is passed as an argument to the operation which will be critical for modeling buffer allocation / lifetimes. Note the 3rd argument AnyRankedTensor:$output.
Next we have a list of arguments. These arguments consist of a mixture of Types (i.e. AnyRankedTensor) and Attributes. Read more about Types & Attributes here.
- AnyRankedTensor is part of a tablegen standard library which type aliases to MLIR's builtin Tensor type, with the added constraint that the tensor has a static rank. As much as possible we want to use the builtin types and infrastructure provided by MLIR.
Next we have a list of results in this case just 1, which aliases the output tensor. One drawback of DPS is that the result tensor and the output tensor will appear to have different SSA names in the IR, but they really alias the same object. This can make writing some passes more cumbersome.
Next we have extraClassDeclaration, which enables us to inject member functions, written directly in C++, into the generated class. We are doing this for this particular case in order to satisfy the DPS interface which requires an implementation for getting the mutated output tensor.
Finally, we have hasVerifier = 1, this tells MLIR that we have a verifier function that will be called to validate the operation. This is a good practice to ensure that the IR is well formed.

We can now try building and opening the TTIROps.h.inc file to see the generated C++ code. We will actually get a linker error because we have hasVerifier = 1 which automatically declared a verifier function, but we need to go implement.

Let's head over to lib/Dialect/TTIR/IR/TTIROps.cpp and implement the verifier.

// MatmulOp verification
::mlir::LogicalResult mlir::tt::ttir::MatmulOp::verify() {
  ::mlir::RankedTensorType inputAType = getA().getType();
  ::mlir::RankedTensorType inputBType = getB().getType();
  ::mlir::RankedTensorType outputType = getOutput().getType();

  llvm::ArrayRef<int64_t> outputShape = outputType.getShape();
  llvm::SmallVector<int64_t> inputAShape(inputAType.getShape());
  llvm::SmallVector<int64_t> inputBShape(inputBType.getShape());

  // Verify that the input A is at least 1D tensor.
  if (inputAType.getRank() < 1) {
    return emitOpError("Input A must be at least a 1D tensor");
  }

  // Verify that the input B is at least 1D tensor.
  if (inputBType.getRank() < 1) {
    return emitOpError("Input B must be at least a 1D tensor");
  }

  // If input A is a vector (1D tensor), 1 is prepended to its dimensions for
  // the purpose of the matrix multiplication. After the matrix
  // multiplication, the prepended dimension is removed. Otherwise, check if
  // the LHS needs to be transposed.
  if (inputAType.getRank() == 1) {
    inputAShape.insert(inputAShape.begin(), 1);
  } else if (getTransposeA()) {
    std::swap(inputAShape[inputAShape.size() - 1],
              inputAShape[inputAShape.size() - 2]);
  }

  // If input B is a vector (1D tensor), a 1 is appended to its dimensions for
  // the purpose of the matrix-vector product and removed afterwards.
  // Otherwise, check if the RHS needs to be transposed.
  if (inputBType.getRank() == 1) {
    inputBShape.push_back(1);
  } else if (getTransposeB()) {
    std::swap(inputBShape[inputBShape.size() - 1],
              inputBShape[inputBShape.size() - 2]);
  }

  // Verify that the input A and input B has matching inner dimensions.
  if (inputAShape[inputAShape.size() - 1] !=
      inputBShape[inputBShape.size() - 2]) {
    return emitOpError("Input A[-1](")
           << inputAShape[inputAShape.size() - 1] << ") and B[-2]("
           << inputBShape[inputBShape.size() - 2]
           << ") must have matching inner dimensions";
  }

  llvm::SmallVector<int64_t> expectedOutputShape;
  // Verify that the batch dimensions are broadcast compatible and construct
  // the expected output shape. If either of input A or input B is at most 2D
  // tensors, the batch dimensions are trivially broadcast compatible.
  if (inputAShape.size() > 2 || inputBShape.size() > 2) {
    llvm::SmallVector<int64_t> inputABatchDims(inputAShape.begin(),
                                               inputAShape.end() - 2);
    llvm::SmallVector<int64_t> inputBBatchDims(inputBShape.begin(),
                                               inputBShape.end() - 2);

    // Verify that the batch dimensions of input A and B are broadcast
    // compatible.
    llvm::SmallVector<int64_t, 4> broadcastedShape;
    if (!mlir::OpTrait::util::getBroadcastedShape(
            inputABatchDims, inputBBatchDims, broadcastedShape)) {

      return emitOpError("Batch dimensions of input A(" +
                         ttmlir::utils::join(inputABatchDims, ",") +
                         ") and B(" +
                         ttmlir::utils::join(inputBBatchDims, ",") +
                         ") are not broadcast compatible");
    }

    // Insert the broadcasted batch dimensions in the expected output shape.
    expectedOutputShape = std::move(broadcastedShape);
  }

  // Insert the input A and B inner dimensions in expected output shape
  // Consider the case where input A and B are vectors. In that case,
  // the dimension 1 is ommited from the output shape.
  if (inputAType.getRank() > 1) {
    expectedOutputShape.push_back(inputAShape[inputAShape.size() - 2]);
  }

  if (inputBType.getRank() > 1) {
    expectedOutputShape.push_back(inputBShape[inputBShape.size() - 1]);
  }

  // Check the case of a vector-vector product. At this moment we don't
  // support scalars in IR, hence check that the output is at least 1D tensor
  // of size 1.
  if (expectedOutputShape.size() == 0) {
    if (outputType.getRank() < 1) {
      return emitOpError("Scalar output is not supported, output must be at "
                         "least a 1D tensor");
    }

    if (outputType.getRank() > 1 || outputType.getShape()[0] != 1) {
      return emitOpError("Scalar output must be a 1D tensor of size 1");
    }

    return success();
  }

  // Verify that the output shape is correct.
  if (outputShape.size() != expectedOutputShape.size()) {
    return emitOpError("Output shape rank(")
           << outputShape.size()
           << ") must match the expected output shape rank("
           << expectedOutputShape.size() << ")";
  }

  // Verify each dim of the output shape.
  for (auto [index, outputDim, expectedDim] : llvm::zip(
           llvm::seq(outputShape.size()), outputShape, expectedOutputShape)) {
    if (outputDim != expectedDim) {
      return emitOpError("Output shape dimension[")
             << index << "](" << outputDim
             << ") doesn't match the expected output shape dimension[" << index
             << "](" << expectedDim << ")";
    }
  }

  return success();
}

2. Define the Op in the TTNN backend dialect

Next we will define the Op in the TTNN dialect. TTNN Ops are defined in the same way, but in their respective set of dialect files. Refer to the previous section for details, the process is the same.

`TTNNOps.td`

def TTNN_MatmulOp : TTNN_Op<"matmul",
      [DeclareOpInterfaceMethods<TTNN_OpModelInterface, ["getOpConstraints", "getOpRuntime"]>]
      > {
    let arguments = (ins AnyRankedTensor:$a,
                         AnyRankedTensor:$b,
                         DefaultValuedAttr<BoolAttr, "false">:$transpose_a,
                         DefaultValuedAttr<BoolAttr, "false">:$transpose_b,
                         OptionalAttr<AnyAttrOf<[
                            TTNN_MatmulMultiCoreReuseProgramConfigAttr,
                            TTNN_MatmulMultiCoreReuseMultiCastProgramConfigAttr,
                            TTNN_MatmulMultiCoreReuseMultiCast1DProgramConfigAttr,
                            TTNN_MatmulMultiCoreReuseMultiCastDRAMShardedProgramConfigAttr
                         ]>>:$matmul_program_config);

    let results = (outs AnyRankedTensor:$result);

    let hasVerifier = 1;
}

`TTNNOps.cpp`

// MatmulOp verification
::mlir::LogicalResult mlir::tt::ttnn::MatmulOp::verify() {
  ::mlir::RankedTensorType inputAType = getA().getType();
  ::mlir::RankedTensorType inputBType = getB().getType();
  ::mlir::RankedTensorType outputType = getResult().getType();

  llvm::ArrayRef<int64_t> outputShape = outputType.getShape();
  llvm::SmallVector<int64_t> inputAShape(inputAType.getShape());
  llvm::SmallVector<int64_t> inputBShape(inputBType.getShape());

  // Verify that the input A is at least 1D tensor.
  if (inputAType.getRank() < 1) {
    return emitOpError("Input A must be at least a 1D tensor");
  }

  // Verify that the input B is at least 1D tensor.
  if (inputBType.getRank() < 1) {
    return emitOpError("Input B must be at least a 1D tensor");
  }

  // If input A is a vector (1D tensor), 1 is prepended to its dimensions for
  // the purpose of the matrix multiplication. After the matrix multiplication,
  // the prepended dimension is removed. Otherwise, check if the LHS needs to be
  // transposed.
  if (inputAType.getRank() == 1) {
    inputAShape.insert(inputAShape.begin(), 1);
  } else if (getTransposeA()) {
    std::swap(inputAShape[inputAShape.size() - 1],
              inputAShape[inputAShape.size() - 2]);
  }

  // If input B is a vector (1D tensor), a 1 is appended to its dimensions for
  // the purpose of the matrix-vector product and removed afterwards. Otherwise,
  // check if the RHS needs to be transposed.
  if (inputBType.getRank() == 1) {
    inputBShape.push_back(1);
  } else if (getTransposeB()) {
    std::swap(inputBShape[inputBShape.size() - 1],
              inputBShape[inputBShape.size() - 2]);
  }

  // Verify that the input A and input B has matching inner dimensions.
  if (inputAShape[inputAShape.size() - 1] !=
      inputBShape[inputBShape.size() - 2]) {
    return emitOpError("Input A[-1](")
           << inputAShape[inputAShape.size() - 1] << ") and B[-2]("
           << inputBShape[inputBShape.size() - 2]
           << ") must have matching inner dimensions";
  }

  llvm::SmallVector<int64_t> expectedOutputShape;
  // Verify that the batch dimensions are broadcast compatible and construct the
  // expected output shape. If either of input A or input B is at most 2D
  // tensors, the batch dimensions are trivially broadcast compatible.
  if (inputAShape.size() > 2 || inputBShape.size() > 2) {
    llvm::SmallVector<int64_t> inputABatchDims(inputAShape.begin(),
                                               inputAShape.end() - 2);
    llvm::SmallVector<int64_t> inputBBatchDims(inputBShape.begin(),
                                               inputBShape.end() - 2);

    // Verify that the batch dimensions of input A and B are broadcast
    // compatible.
    llvm::SmallVector<int64_t, 4> broadcastedShape;
    if (!OpTrait::util::getBroadcastedShape(inputABatchDims, inputBBatchDims,
                                            broadcastedShape)) {

      return emitOpError("Batch dimensions of input A(" +
                         ttmlir::utils::join(inputABatchDims, ",") +
                         ") and B(" +
                         ttmlir::utils::join(inputBBatchDims, ",") +
                         ") are not broadcast compatible");
    }

    // Insert the broadcasted batch dimensions in the expected output shape.
    expectedOutputShape = std::move(broadcastedShape);
  }

  // Insert the input A and B inner dimensions in expected output shape
  // Consider the case where input A and B are vectors. In that case,
  // the dimension 1 is ommited from the output shape.
  if (inputAType.getRank() > 1) {
    expectedOutputShape.push_back(inputAShape[inputAShape.size() - 2]);
  }

  if (inputBType.getRank() > 1) {
    expectedOutputShape.push_back(inputBShape[inputBShape.size() - 1]);
  }

  // Check the case of a vector-vector product. At this moment we don't support
  // scalars in IR, hence check that the output is at least 1D tensor of size 1.
  if (expectedOutputShape.size() == 0) {
    if (outputType.getRank() < 1) {
      return emitOpError("Scalar output is not supported, output must be at "
                         "least a 1D tensor");
    }

    if (outputType.getRank() > 1 || outputType.getShape()[0] != 1) {
      return emitOpError("Scalar output must be a 1D tensor of size 1");
    }

    return success();
  }

  // Verify that the output shape is correct.
  if (outputShape.size() != expectedOutputShape.size()) {
    return emitOpError("Output shape rank(")
           << outputShape.size()
           << ") must match the expected output shape rank("
           << expectedOutputShape.size() << ")";
  }

  // Verify each dim of the output shape.
  for (auto [index, outputDim, expectedDim] : llvm::zip(
           llvm::seq(outputShape.size()), outputShape, expectedOutputShape)) {
    if (outputDim != expectedDim) {
      return emitOpError("Output shape dimension[")
             << index << "](" << outputDim
             << ") doesn't match the expected output shape dimension[" << index
             << "](" << expectedDim << ")";
    }
  }

  return success();
}

For more details on adding ops to the TTNN dialect, refer to TTNN Dialect Contribution Guidelines.

3. Convert / Implement the Op in the TTNN passes

TTIR to TTNN

Next we will implement the conversion from the TTIR matmul Op to the TTNN matmul Op. This is a trivial conversion, as the Ops are identical in their semantics, so the changeset isn't going to be very instructive, but will at least point to the files involved. The conversion is implemented in the ConvertTTIRToTTNNPass pass in file lib/Conversion/TTIRToTTNN/TTIRToTTNNPass.cpp.

Zooming into class ConvertTTIRToTTNNPass we can see we implement the pass interface via member function void runOnOperation() final. This function will be called for every operation matching the type specified in the pass tablegen file. A quick look at include/ttmlir/Conversion/Passes.td we can see:

def ConvertTTIRToTTNN: Pass<"convert-ttir-to-ttnn", "::mlir::ModuleOp"> {

This means that runOnOperation will be called for every ModuleOp in the graph, usually there is only one ModuleOp which serves as the root of the graph.

Inside runOnOperation is usually where we define a rewrite pattern set that can match much more complicated patterns (nested inside of the ModuleOp's regions) than just a single operation. In runOperation method you will see the call to method populateTTIRToTTNNPatterns(...) that actually generates rewrite patterns. Method populateTTIRToTTNNPatterns(...) is defined in lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp.

  patterns
      .add<TensorEmptyConversionPattern,
           NamedFullConversionPattern<ttir::ZerosOp, ttnn::ZerosOp>,
           NamedFullConversionPattern<ttir::OnesOp, ttnn::OnesOp>,
           FullOpConversionPattern,
           ToLayoutOpConversionPattern,
           QuantizationOpConversionPattern<ttir::QuantizeUnrolledOp, ttnn::QuantizeOp>,
           QuantizationOpConversionPattern<ttir::DequantizeUnrolledOp, ttnn::DequantizeOp>,
           RequantizeOpConversionPattern,
           ElementwiseBinaryOpConversionPattern<ttir::AddOp, ttnn::AddOp>,
           ElementwiseBinaryOpConversionPattern<ttir::SubtractOp, ttnn::SubtractOp>,
           ElementwiseBinaryOpConversionPattern<ttir::MultiplyOp, ttnn::MultiplyOp>,
           ElementwiseBinaryOpConversionPattern<ttir::DivOp, ttnn::DivideOp>,
           ElementwiseBinaryOpConversionPattern<ttir::EqualOp, ttnn::EqualOp>,
           ElementwiseBinaryOpConversionPattern<ttir::NotEqualOp, ttnn::NotEqualOp>,
           ElementwiseBinaryOpConversionPattern<ttir::GreaterEqualOp, ttnn::GreaterEqualOp>,
           ElementwiseBinaryOpConversionPattern<ttir::GreaterThanOp, ttnn::GreaterThanOp>,
           ElementwiseBinaryOpConversionPattern<ttir::LessEqualOp, ttnn::LessEqualOp>,
           ElementwiseBinaryOpConversionPattern<ttir::LessThanOp, ttnn::LessThanOp>,
           ElementwiseBinaryOpConversionPattern<ttir::LogicalAndOp, ttnn::LogicalAndOp>,
           ElementwiseBinaryOpConversionPattern<ttir::LogicalOrOp, ttnn::LogicalOrOp>,
           ElementwiseBinaryOpConversionPattern<ttir::LogicalXorOp, ttnn::LogicalXorOp>,
           ElementwiseOpConversionPattern<ttir::BitwiseAndOp, ttnn::BitwiseAndOp>,
           ElementwiseOpConversionPattern<ttir::BitwiseOrOp, ttnn::BitwiseOrOp>,
           ElementwiseOpConversionPattern<ttir::BitwiseXorOp, ttnn::BitwiseXorOp>,
           ElementwiseOpConversionPattern<ttir::MaximumOp, ttnn::MaximumOp>,
           ElementwiseOpConversionPattern<ttir::MinimumOp, ttnn::MinimumOp>,
           ElementwiseOpConversionPattern<ttir::RemainderOp, ttnn::RemainderOp>,
           ElementwiseOpConversionPattern<ttir::Atan2Op, ttnn::Atan2Op>,
           ElementwiseOpConversionPattern<ttir::PowOp, ttnn::PowOp>,
           ElementwiseOpConversionPattern<ttir::AbsOp, ttnn::AbsOp>,
           ElementwiseOpConversionPattern<ttir::CbrtOp, ttnn::CbrtOp>,
           ElementwiseOpConversionPattern<ttir::FloorOp, ttnn::FloorOp>,
           ElementwiseOpConversionPattern<ttir::IsFiniteOp, ttnn::IsFiniteOp>,
           ElementwiseOpConversionPattern<ttir::LogicalNotOp, ttnn::LogicalNotOp>,
           ElementwiseOpConversionPattern<ttir::BitwiseNotOp, ttnn::BitwiseNotOp>,
           ElementwiseOpConversionPattern<ttir::NegOp, ttnn::NegOp>,
           ElementwiseOpConversionPattern<ttir::ReluOp, ttnn::ReluOp>,
           ElementwiseOpConversionPattern<ttir::GeluOp, ttnn::GeluOp>,
           ElementwiseOpConversionPattern<ttir::SqrtOp, ttnn::SqrtOp>,
           ElementwiseOpConversionPattern<ttir::RsqrtOp, ttnn::RsqrtOp>,
           ElementwiseOpConversionPattern<ttir::SignOp, ttnn::SignOp>,
           ElementwiseOpConversionPattern<ttir::SigmoidOp, ttnn::SigmoidOp>,
           ElementwiseOpConversionPattern<ttir::Log1pOp, ttnn::Log1pOp>,
           ElementwiseOpConversionPattern<ttir::ReciprocalOp, ttnn::ReciprocalOp>,
           ElementwiseOpConversionPattern<ttir::ExpOp, ttnn::ExpOp>,
           ElementwiseOpConversionPattern<ttir::ErfOp, ttnn::ErfOp>,
           ElementwiseOpConversionPattern<ttir::ErfcOp, ttnn::ErfcOp>,
           ElementwiseOpConversionPattern<ttir::LogOp, ttnn::LogOp>,
           ElementwiseOpConversionPattern<ttir::CeilOp, ttnn::CeilOp>,
           ElementwiseOpConversionPattern<ttir::SinOp, ttnn::SinOp>,
           ElementwiseOpConversionPattern<ttir::CosOp, ttnn::CosOp>,
           ElementwiseOpConversionPattern<ttir::Expm1Op, ttnn::Expm1Op>,
           ElementwiseOpConversionPattern<ttir::WhereOp, ttnn::WhereOp>,
           ElementwiseOpConversionPattern<ttir::TanOp, ttnn::TanOp>,
           ElementwiseOpConversionPattern<ttir::TanhOp, ttnn::TanhOp>,
           ElementwiseOpConversionPattern<ttir::AtanOp, ttnn::AtanOp>,
           Pooling2dOpConversionPattern<ttir::MaxPool2dOp, ttnn::MaxPool2dOp>,
           Pooling2dOpConversionPattern<ttir::AvgPool2dOp, ttnn::AvgPool2dOp>,
           ReductionOpConversionPattern<ttir::SumOp, ttnn::SumOp>,
           ReductionOpConversionPattern<ttir::MeanOp, ttnn::MeanOp>,
           ReductionOpConversionPattern<ttir::MaxOp, ttnn::MaxOp>,
           ReductionOpConversionPattern<ttir::MinOp, ttnn::MinOp>,
           ReductionProdOpConversionPattern,
           ReductionArgMaxOpConversionPattern,
           ElementwiseUnaryWithFloatParameterOpConversionPattern<ttir::LeakyReluOp, ttnn::LeakyReluOp>,
           BroadcastOpConversionPattern,
           PadOpConversionPattern,
           EmbeddingOpConversionPattern,
           EmbeddingBackwardOpConversionPattern,
           RepeatOpConversionPattern,
           CumSumOpConversionPattern,
           RepeatInterleaveOpConversionPattern,
           SoftmaxOpConversionPattern,
           SortOpConversionPattern,
           TypecastOpConversionPattern,
           ClampOpConversionPattern<ttir::ClampScalarOp, ttnn::ClampScalarOp>,
           ClampOpConversionPattern<ttir::ClampTensorOp, ttnn::ClampTensorOp>,
           ConcatOpConversionPattern,
           ReshapeOpConversionPattern,
           SliceOpConversionPattern,
           SqueezeOpConversionPattern,
           UnsqueezeOpConversionPattern,
           ConstantOpConversionPattern,
           LinearOpConversionPattern,
           BatchNormOpConversionPattern,
           MatmulOpConversionPattern,
           Conv2dOpConversionPattern,
           ConvTranspose2dOpConversionPattern,
           MeshShardOpConversionPattern,
           AllReduceOpConversionPattern,
           AllGatherOpConversionPattern,
           ReduceScatterOpConversionPattern,
           CollectivePermuteOpConversionPattern,
           ArangeOpConversionPattern,
           RandOpConversionPattern,
           UpdateCacheOpConversionPattern,
           FillCacheOpConversionPattern,
           ScatterOpConversionPattern,
           PermuteOpConversionPattern,
           UpsampleOpConversionPattern,
           AllToAllOpConversionPattern,
           CollectiveBroadcastOpConversionPattern,
           ConcatenateHeadsOpConversionPattern
           >(typeConverter, ctx);

More information on rewrite patterns and their capabilities can be found in the MLIR documentation here and here.

For matmul, we defined a new conversion pattern that's generic to all binary ops with arguments named a and b:

namespace {
class MatmulOpConversionPattern : public OpConversionPattern<ttir::MatmulOp> {
public:
  using OpConversionPattern<ttir::MatmulOp>::OpConversionPattern;

  LogicalResult
  matchAndRewrite(ttir::MatmulOp op, OpAdaptor adaptor,
                  ConversionPatternRewriter &rewriter) const override {
    rewriter.replaceOpWithNewOp<ttnn::MatmulOp>(
        op, this->getTypeConverter()->convertType(op.getType()), adaptor.getA(),
        adaptor.getB(), adaptor.getTransposeA(), adaptor.getTransposeB(),
        nullptr);
    return success();
  }
};
} // namespace

Invoked as part of the rewrite set:

MatmulOpConversionPattern

TTNN to EmitC

Similarly, we also need to add a pattern to convert from TTNN dialect to EmitC dialect.

Method to populate rewrite patterns can be found in lib/Conversion/TTNNToEmitC/TTNNToEmitC.cpp:

void populateTTNNToEmitCPatterns(mlir::MLIRContext *ctx,
                                 mlir::RewritePatternSet &patterns,
                                 TypeConverter &typeConverter) {
  // Device ops
  //
  patterns.add<TTDeviceOpConversionPattern>(typeConverter, ctx);
  patterns.add<GetDeviceOpConversionPattern>(typeConverter, ctx);

  // Memory ops
  //
  // clang-format off
  patterns.add<ToLayoutOpConversionPattern,
               ToMemoryConfigOpConversionPattern,
               ToDTypeOpConversionPattern,
               TypecastOpConversionPattern,
               ToDeviceOpConversionPattern,
               FromDeviceOpConversionPattern,
               DeallocateOpConversionPattern>(typeConverter, ctx);
  // clang-format on

  // Tensor ops
  //
  // clang-format off
  patterns.add<EmptyOpConversionPattern,
               NamedFullOpConversionPattern<mlir::tt::ttnn::ZerosOp>,
               NamedFullOpConversionPattern<mlir::tt::ttnn::OnesOp>,
               FullOpConversionPattern,
               DefaultOpConversionPattern<mlir::tt::ttnn::ArangeOp>,
               DefaultOpConversionPattern<mlir::tt::ttnn::ConstantOp>,
               RandOpConversionPattern>(typeConverter, ctx);
  // clang-format on

  // Eltwise unary ops
  //
  patterns.add<
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::AbsOp>,
      EltwiseUnaryCompositeOpConversionPattern<mlir::tt::ttnn::CbrtOp>,
      ClampOpConversionPattern<::mlir::tt::ttnn::ClampScalarOp>,
      ClampOpConversionPattern<mlir::tt::ttnn::ClampTensorOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::FloorOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::IsFiniteOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::LogicalNotOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::BitwiseNotOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::NegOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::ReluOp>,
      ElementwiseUnaryWithFloatParameterOpConversionPattern<
          mlir::tt::ttnn::LeakyReluOp>,
      EltwiseUnaryWithFastAndApproximateModeOpConversionPattern<
          mlir::tt::ttnn::GeluOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::SqrtOp>,
      EltwiseUnaryWithFastAndApproximateModeOpConversionPattern<
          mlir::tt::ttnn::RsqrtOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::SignOp>,
      EltwiseUnaryWithVectorAndFastAndApproximateModeOpConversionPattern<
          mlir::tt::ttnn::SigmoidOp>,
      EltwiseUnaryCompositeOpConversionPattern<mlir::tt::ttnn::Log1pOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::ReciprocalOp>,
      EltwiseUnaryWithFastAndApproximateModeOpConversionPattern<
          mlir::tt::ttnn::ExpOp>,
      EltwiseUnaryWithFastAndApproximateModeOpConversionPattern<
          mlir::tt::ttnn::ErfOp>,
      EltwiseUnaryWithFastAndApproximateModeOpConversionPattern<
          mlir::tt::ttnn::ErfcOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::CeilOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::SinOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::CosOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::Expm1Op>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::TanOp>,
      EltwiseUnaryWithAccuracyModeOpConversionPattern<mlir::tt::ttnn::TanhOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::AtanOp>,
      EltwiseUnaryOpConversionPattern<mlir::tt::ttnn::LogOp>>(typeConverter,
                                                              ctx);

  // Eltwise binary ops
  //
  patterns.add<
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::AddOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::SubtractOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::MultiplyOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::LogicalAndOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::LogicalOrOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::LogicalXorOp>,
      EltwiseBinaryCompositeOpConversionPattern<mlir::tt::ttnn::BitwiseAndOp>,
      EltwiseBinaryCompositeOpConversionPattern<mlir::tt::ttnn::BitwiseOrOp>,
      EltwiseBinaryCompositeOpConversionPattern<mlir::tt::ttnn::BitwiseXorOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::EqualOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::NotEqualOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::GreaterEqualOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::GreaterThanOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::LessEqualOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::LessThanOp>,
      EltwiseBinaryNGCompositeOpConversionPattern<mlir::tt::ttnn::MaximumOp>,
      EltwiseBinaryNGCompositeOpConversionPattern<mlir::tt::ttnn::MinimumOp>,
      EltwiseBinaryOpConversionPattern<mlir::tt::ttnn::DivideOp>,
      EltwiseBinaryCompositeOpConversionPattern<mlir::tt::ttnn::ScatterOp>,
      EltwiseBinaryCompositeOpConversionPattern<mlir::tt::ttnn::RemainderOp>,
      EltwiseBinaryNGCompositeOpConversionPattern<mlir::tt::ttnn::PowOp>,
      EltwiseBinaryCompositeOpConversionPattern<mlir::tt::ttnn::Atan2Op>>(
      typeConverter, ctx);

  // Eltwise ternary ops
  //
  patterns.add<EltwiseTernaryOpConversionPattern<mlir::tt::ttnn::WhereOp>>(
      typeConverter, ctx);

  // Tensor manipulation ops
  //
  patterns.add<TransposeOpConversionPattern, ConcatOpConversionPattern,
               ReshapeOpConversionPattern, RepeatOpConversionPattern,
               RepeatInterleaveOpConversionPattern, SliceOpConversionPattern,
               SortOpConversionPattern, PermuteOpConversionPattern,
               DefaultOpConversionPattern<mlir::tt::ttnn::PadOp>>(typeConverter,
                                                                  ctx);

  // Quantization ops.
  //
  patterns.add<QuantizationOpConversionPattern<mlir::tt::ttnn::QuantizeOp>,
               QuantizationOpConversionPattern<mlir::tt::ttnn::DequantizeOp>,
               RequantizeOpConversionPattern>(typeConverter, ctx);

  // Matmul ops
  //
  patterns.add<LinearOpConversionPattern, MatmulOpConversionPattern>(
      typeConverter, ctx);

  // Reduction ops
  //
  patterns.add<ReductionOpConversionPattern<mlir::tt::ttnn::SumOp>,
               ReductionOpConversionPattern<mlir::tt::ttnn::MeanOp>,
               ReductionOpConversionPattern<mlir::tt::ttnn::MaxOp>,
               ReductionOpConversionPattern<mlir::tt::ttnn::MinOp>,
               ProdOpConversionPattern, ArgMaxOpConversionPattern>(
      typeConverter, ctx);

  // Pooling ops
  //
  patterns.add<AvgPool2dOpConversionPattern>(typeConverter, ctx);
  patterns.add<MaxPool2dOpConversionPattern>(typeConverter, ctx);
  patterns.add<UpsampleOpConversionPattern>(typeConverter, ctx);

  // Convolution ops
  //
  patterns.add<PrepareConv2dWeightsOpConversionPattern>(typeConverter, ctx);
  patterns.add<PrepareConv2dBiasOpConversionPattern>(typeConverter, ctx);
  patterns.add<Conv2dOpConversionPattern>(typeConverter, ctx);
  patterns.add<ConvTranspose2dOpConversionPattern>(typeConverter, ctx);

  // Other ops
  //
  patterns.add<SoftmaxOpConversionPattern, EmbeddingOpConversionPattern,
               DefaultOpConversionPattern<mlir::tt::ttnn::EmbeddingBackwardOp>,
               MorehCumSumOpConversionPattern, BatchNormOpConversionPattern>(
      typeConverter, ctx);

  // CCL ops
  //
  patterns.add<AllGatherOpConversionPattern>(typeConverter, ctx);
  patterns.add<ReduceScatterOpConversionPattern>(typeConverter, ctx);
  patterns.add<CollectivePermuteOpConversionPattern>(typeConverter, ctx);
  patterns.add<MeshShardOpConversionPattern>(typeConverter, ctx);
  patterns.add<PointToPointOpConversionPattern>(typeConverter, ctx);

  // KV Cache ops
  //
  patterns.add<DefaultOpConversionPattern<mlir::tt::ttnn::UpdateCacheOp>>(
      typeConverter, ctx);
  patterns.add<DefaultOpConversionPattern<mlir::tt::ttnn::FillCacheOp>>(
      typeConverter, ctx);

  // Arith ops
  //
  patterns.add<ArithConstantOpConversionPattern>(typeConverter, ctx);

  // Tuple ops
  //
  patterns.add<GetTupleElementOpConversionPattern>(typeConverter, ctx);
  patterns.add<TupleOpConversionPattern>(typeConverter, ctx);

  // LoadCached op
  //
  patterns.add<LoadCachedOpConversionPattern>(typeConverter, ctx);

  // Module op
  //
  patterns.add<ModuleOpConversionPattern>(typeConverter, ctx);

  // FuncOp
  //
  patterns.add<FuncOpConversionPattern>(typeConverter, ctx);

  // Transformers ops
  //
  patterns.add<ConcatenateHeadsOpConversionPattern>(typeConverter, ctx);
}

Writing conversion patterns to EmitC is a little tricky at first. In general case, we will be converting an op that has operands (SSAs) and attributes (e.g. data type) as arguments. We want to flatten these arguments at call site.

We'll use EmitC's CallOpaqueOp as the target op. Let's take a look at our matmul IR within TTNN dialect:

"ttnn.matmul"(%2, %4, %5) : (tensor<64x128xbf16, #ttnn_layout4>, tensor<128x96xbf16, #ttnn_layout6>, tensor<64x96xbf16, #ttnn_layout7>) -> tensor<64x96xbf16, #ttnn_layout7>

Now let's look at matmul's call signature in TTNN lib:

    static Tensor invoke(
        const Tensor& input_tensor_a,
        const Tensor& input_tensor_b,
        const bool transpose_a = false,
        const bool transpose_b = false,
        const std::optional<const MemoryConfig>& memory_config = std::nullopt,
        const std::optional<const DataType> dtype = std::nullopt,
        const std::optional<const MatmulProgramConfig>& program_config = std::nullopt,
        const std::optional<const std::string>& activation = std::nullopt,
        const std::optional<const DeviceComputeKernelConfig> compute_kernel_config = std::nullopt,
        const std::optional<const CoreGrid> core_grid = std::nullopt,
        const std::optional<const tt::tt_metal::Tile>& output_tile = std::nullopt,
        std::optional<Tensor> optional_output_tensor = std::nullopt,
        const std::optional<const DeviceGlobalCircularBuffer>& global_cb = std::nullopt);

If we look closely, we'll notice that the IR has way less arguments than can be seen in the actual signature of the op - as we're lowering to EmitC, which gets translated into actual C++ code, we need to correct for this (ideally the op would be perfectly modelled with all the arguments, but that is not the case today).

We do this by filling in the gaps. EmitC's CallOpaqueOp takes in an array of attributes, and an array of operands, which need to be combined. The combining is done by extending the array of attributes with "pointers" into operands, like so:

    llvm::SmallVector<mlir::Attribute> args{
        emitter.emit(srcOp.getA()),
        emitter.emit(srcOp.getB()),
        emitter.emit(srcOp.getTransposeA()),
        emitter.emit(srcOp.getTransposeB()),
        emitter.emit(std::nullopt) | emitter.getMemoryConfig(srcOp.getResult()),
    };

Pointers are denoted with IndexTypes, wrapped into IntegerAttrs. Attributes are converted into EmitC's OpaqueAttr which can, for practical purposes, be treated as strings: a BoolAttr carrying "false" as value needs to be converted into an OpaqueAttr whose value is a string "false", which is what the convertBoolAttr function does.

This is our final converted EmitC CallOpaqueOp:

emitc.call_opaque "ttnn::matmul"(%3, %6, %9) {args = [0 : index, 1 : index, #emitc.opaque<"false">, #emitc.opaque<"false">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, #emitc.opaque<"std::nullopt">, 2 : index]} : (!emitc.opaque<"ttnn::Tensor">, !emitc.opaque<"ttnn::Tensor">, !emitc.opaque<"ttnn::Tensor">) -> !emitc.opaque<"ttnn::Tensor">

which, when translated to C++ code, looks like:

ttnn::matmul(v6, v9, false, false, std::nullopt, std::nullopt, std::nullopt, std::nullopt, std::nullopt, std::nullopt, std::nullopt, v12);

Full conversion pattern for matmul op:

namespace {
class MatmulOpConversionPattern
    : public TTNNToEmitCBaseOpConversionPattern<mlir::tt::ttnn::MatmulOp> {

public:
  using TTNNToEmitCBaseOpConversionPattern<
      mlir::tt::ttnn::MatmulOp>::TTNNToEmitCBaseOpConversionPattern;

  LogicalResult
  matchAndRewrite(mlir::tt::ttnn::MatmulOp srcOp,
                  mlir::tt::ttnn::MatmulOp::Adaptor adaptor,
                  ConversionPatternRewriter &rewriter) const override {

    ttnn_to_emitc::EmitCTTNNEmitter<mlir::tt::ttnn::MatmulOp> emitter(
        srcOp, adaptor, rewriter);

    llvm::SmallVector<mlir::Attribute> args{
        emitter.emit(srcOp.getA()),
        emitter.emit(srcOp.getB()),
        emitter.emit(srcOp.getTransposeA()),
        emitter.emit(srcOp.getTransposeB()),
        emitter.emit(std::nullopt) | emitter.getMemoryConfig(srcOp.getResult()),
    };

    emitter.replaceOp(*this, args);

    return success();
  }
};
} // namespace

4. Add a compiler unit test for the Op

So far we have defined the Op in the TTIR and TTNN dialects, implemented verifiers, and have conversion passes. Now we need to add a unit test to ensure that the pass is working correctly. The compiler unit tests are located in test/ttmlir/Dialect area. In this case we'll add a test under the TTNN subdirectory since we are testing the ConvertTTIRToTTNNPass.

`test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir`

// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline -o %t %s
// RUN: FileCheck %s --input-file=%t
module {
  func.func @forward(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
    %0 = ttir.empty() : tensor<64x96xbf16>
    // CHECK: "ttnn.matmul"
    %1 = "ttir.matmul"(%arg0, %arg1, %0) : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
    return %1 : tensor<64x96xbf16>
  }
}

Unit tests in MLIR are typically written using a tool called FileCheck, please refer to the llvm FileCheck documentation for a tutorial and more information about the RUN and CHECK directives.

A few things to point out specifically regarding tt-mlir dialects:

ttcore.system_desc: This is a 1-1 mapping to the SystemDesc flatbuffer schema that is used to describe the system configuration. This is a required attribute tagged on the top level module for all tt-mlir dialects.
Pass --ttnn-layout is a prerequisite before running convert-ttir-to-ttnn. This pass is responsible for converting the input tensors to device memory space and tile layout before lowering to TTNN.
This test is asserting that ttir.matmul converts to ttnn.matmul.

To run the test, you can use the following command:

cmake --build build -- check-ttmlir

You can also manually run ttmlir-opt on the test file to see the resulting output:

./build/bin/ttmlir-opt --ttcore-register-device="system-desc-path=<PATH_TO_SYSTEM_DESC>" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir

5. Define flatbuffer schema for the Op

Next we will define the flatbuffer schema for the Op. The schema must capture all tensor inputs, outputs, and attributes of the Op, i.e. everything the runtime needs to execute the Op.

The schema can be placed in an existing .fbs file located in the include/ttmlir/Target/TTNN/operations directory.

If no suitable .fbs file exists for the operation category, feel free to create new .fbs files as needed. After creating a new .fbs file, remember to add a corresponding cmake target in the include/ttmlir/Target/TTNN/CMakeLists.txt file.

`include/ttmlir/Target/TTNN/CMakeLists.txt`

  operations/matmul.fbs

In our case, we can add our schema to include/ttmlir/Target/TTNN/operations/matmul.fbs directly, without needing to create a new file.

`include/ttmlir/Target/TTNN/operations/matmul.fbs`

table MatmulOp {
  a: tt.target.ttnn.TensorRef;
  b: tt.target.ttnn.TensorRef;
  out: tt.target.ttnn.TensorRef;
  transpose_a: bool;
  transpose_b: bool;
  matmul_program_config: tt.target.ttnn.MatmulProgramConfig;
}

Type TensorRef, flatbuffer tables with suffix Ref are used to represent live values during the runtime, decoupled from the underlying Desc suffixes which carry the type and attribute information for the object.

After creating the schema for our new operation type, we need to register it in the OpType union within program.fbs. This file serves as the main entry point for all program information, where the OpType union collects and defines all supported operation types and their corresponding schemas.

`include/ttmlir/Target/TTNN/program.fbs`

  MatmulOp,

If a new .fbs file was created, don't forget to include the new file in include/ttmlir/Target/TTNN/program.fbs.

include "ttmlir/Target/TTNN/operations/matmul.fbs";

More information about writing flatbuffer schemas can be found in the flatbuffers documentation

6. Serialize the Op in the flatbuffer format

In the previous section we defined the flatbuffer schema for the matmul Op, now let's put our new schema definition to use. The schema is used as input to a program called flatc which generates C++ code (or any language for that matter) for serializing and deserializing the schema. This generated code can be found in build/include/ttmlir/Target/TTNN/program_generated.h.

Let's head over to lib/Target/TTNN/TTNNToFlatbuffer.cpp to define a createOp overloaded function that does the conversion from MLIR to flatbuffer:

::flatbuffers::Offset<::tt::target::ttnn::MatmulOp>
createOp(FlatbufferObjectCache &cache, MatmulOp op) {
  auto a = cache.at<::tt::target::ttnn::TensorRef>(
      getOperandThroughDPSOps(op.getA()));
  auto b = cache.at<::tt::target::ttnn::TensorRef>(
      getOperandThroughDPSOps(op.getB()));
  auto output = cache.getOrCreate(op.getResult(), tensorValueToFlatbuffer,
                                  kHostAllocatedSize);

  using MatmulConfigType = ::tt::target::ttnn::MatmulProgramConfig;
  MatmulConfigType matmulProgramConfigType = MatmulConfigType::NONE;
  ::flatbuffers::Offset<void> matmulProgramConfigDesc;
  if (auto matmulProgramConfig = op.getMatmulProgramConfigAttr()) {
    if (auto config =
            mlir::dyn_cast<ttnn::MatmulMultiCoreReuseProgramConfigAttr>(
                matmulProgramConfig)) {
      matmulProgramConfigType =
          MatmulConfigType::MatmulMultiCoreReuseProgramConfig;
      matmulProgramConfigDesc = toFlatbuffer(cache, config).Union();
    } else if (auto config = mlir::dyn_cast<
                   ttnn::MatmulMultiCoreReuseMultiCastProgramConfigAttr>(
                   matmulProgramConfig)) {
      matmulProgramConfigType =
          MatmulConfigType::MatmulMultiCoreReuseMultiCastProgramConfig;
      matmulProgramConfigDesc = toFlatbuffer(cache, config).Union();
    } else if (auto config = mlir::dyn_cast<
                   ttnn::MatmulMultiCoreReuseMultiCast1DProgramConfigAttr>(
                   matmulProgramConfig)) {
      matmulProgramConfigType =
          MatmulConfigType::MatmulMultiCoreReuseMultiCast1DProgramConfig;
      matmulProgramConfigDesc = toFlatbuffer(cache, config).Union();
    } else if (
        auto config = mlir::dyn_cast<
            ttnn::MatmulMultiCoreReuseMultiCastDRAMShardedProgramConfigAttr>(
            matmulProgramConfig)) {
      matmulProgramConfigType = MatmulConfigType::
          MatmulMultiCoreReuseMultiCastDRAMShardedProgramConfig;
      matmulProgramConfigDesc = toFlatbuffer(cache, config).Union();
    }
  }

  return ::tt::target::ttnn::CreateMatmulOp(
      *cache.fbb, a, b, output, op.getTransposeA(), op.getTransposeB(),
      matmulProgramConfigType, matmulProgramConfigDesc);
}

Lots of things are happening here, let's break it down:

FlatbufferObjectCache: This is a helper class that is used to cache objects in the flatbuffer that are created during the serialization process. This is necessary for managing value lifetimes and identifiers, at the same time it is an optimization to avoid having multiple copies of the same object. For example, a TensorRef with multiple uses could naively be recreated, one for each use, but with the cache we can ensure that the object is only created once and all uses point to the same flatbuffer offset. The cache is passed around to all serialization functions and should be used whenever creating a new object.
getOperandThroughDPSOps: In section 1. we discussed DPS semantics and the drawback of having the result alias the output tensor. This is one of those cases where we need to use a helper function to trace through the output operands to find the original SSA name in order to associate it with the original TensorRef.
CreateMatmulOp: The autogenerated function from the flatbuffer schema that actually serializes the data into the flatbuffer format.

We can finally generate a binary with our new Op! We can use the following command:

./build/bin/ttmlir-opt --ttcore-register-device="system-desc-path=<PATH_TO_SYSTEM_DESC>" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir | ./build/bin/ttmlir-translate --ttnn-to-flatbuffer -o out.ttnn

And we can inspect the with ttrt:

ttrt read out.ttnn

Note: If the above ttrt command yields a segfault, a clean build of your workspace may be required: Build Instructions

7. Add runtime support for the Op

Next, we want to add runtime support for the Op by parsing the flatbuffer and invoking the TTNN API.

`runtime/lib/ttnn/operations/matmul/matmul.cpp`

void run(const ::tt::target::ttnn::MatmulOp *op, ProgramContext &context) {
  ProgramTensorPool &tensorPool = context.getTensorPool();
  const ::ttnn::Tensor &lhs = tensorPool.getTTNNTensorAndValidate(op->a());
  const ::ttnn::Tensor &rhs = tensorPool.getTTNNTensorAndValidate(op->b());

  auto outputMemoryConfig =
      ::tt::runtime::ttnn::utils::createMemoryConfigIfNeeded(
          ::tt::runtime::ttnn::utils::getTensorRefMemoryConfig(op->out()));
  LOG_ASSERT(::tt::runtime::ttnn::utils::inSystemMemory(op->out()) ||
                 outputMemoryConfig,
             "Memory config must exist for device tensors");

  ::ttnn::DataType outputDataType = utils::getDataType(op->out());

  std::optional<::ttnn::operations::matmul::MatmulProgramConfig>
      matmulProgramConfig = utils::createMatmulProgramConfigIfNeeded(op);

  ::ttnn::Tensor output = ::ttnn::matmul(
      lhs, rhs, op->transpose_a(), op->transpose_b(), outputMemoryConfig,
      outputDataType, matmulProgramConfig,
      /*activation=*/std::nullopt, /*compute_kernel_config=*/std::nullopt,
      /*core_grid=*/std::nullopt, /*output_tile=*/std::nullopt,
      /* optional_output_tensor=*/std::nullopt);

  tensorPool.insertTTNNTensorAndValidate(op->out(), output);
}

A couple things to note from above:

Most runtime op functions will follow a similar pattern, they will take in some additional datastructures for managing the program context.
- Program context tracks the state of the current program. It stores intermediate tensors and devices.
tensorPool.at(op->in0()->global_id()): global_id is a unique identifier for the tensor that was generated and managed by the FlatbufferObjectCache. This is how it's intended to be used by the runtime.
Some operations may belong to a larger set of operations. For example, any eltwise unary operations can be added in runtime/lib/ttnn/operations/eltwise/unary.cpp directly without needing to create a new file.

If a new file is created for the op, we need to add a new source to runtime/lib/ttnn/operations/CMakeLists.txt and a new case to runtime/lib/ttnn/program_executor.cpp.

To update runtime/lib/ttnn/operations/CMakeLists.txt, include the path to the source file in TTNN_OPS_SRCS:

`runtime/lib/ttnn/operations/CMakeLists.txt`

  ${CMAKE_CURRENT_SOURCE_DIR}/matmul/matmul.cpp

To update runtime/lib/ttnn/program_executor.cpp, add a new case to the runOperation method of ProgramExecutor:

`runtime/lib/ttnn/program_executor.cpp`

  case ::tt::target::ttnn::OpType::MatmulOp: {
    return operations::matmul::run(op->type_as_MatmulOp(), getContext());
  }

We can test our changes with ttrt (don't forget to rebuild ttrt):

ttrt run out.ttnn

8. Add a silicon unit test for the Op

After adding runtime support, we're ready to test our Op on silicon. All silicon tests are located under test/ttmlir/Silicon. The process is similar to adding a compiler unit test.

In our specific case, we create a unit test here:

`test/ttmlir/Silicon/TTNN/matmul/simple_matmul.mlir`

// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%" -o %t.mlir %s
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer -o %t.ttnn %t.mlir
module {
  func.func @forward(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
    %0 = ttir.empty() : tensor<64x96xbf16>
    // CHECK: "ttnn.matmul"
    %1 = "ttir.matmul"(%arg0, %arg1, %0) : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
    return %1 : tensor<64x96xbf16>
  }

  func.func @matmul_transpose_lhs(%arg0: tensor<64x128xbf16>, %arg1: tensor<64x128xbf16>) -> tensor<128x128xbf16> {
    %0 = ttir.empty() : tensor<128x128xbf16>
    // CHECK: "ttnn.matmul"
    %1 = "ttir.matmul"(%arg0, %arg1, %0) <{transpose_a = true}>: (tensor<64x128xbf16>, tensor<64x128xbf16>, tensor<128x128xbf16>) -> tensor<128x128xbf16>
    return %1 : tensor<128x128xbf16>
  }

  func.func @matmul_transpose_rhs(%arg0: tensor<64x128xbf16>, %arg1: tensor<64x128xbf16>) -> tensor<64x64xbf16> {
    %0 = ttir.empty() : tensor<64x64xbf16>
    // CHECK: "ttnn.matmul"
    %1 = "ttir.matmul"(%arg0, %arg1, %0) <{transpose_b = true}>: (tensor<64x128xbf16>, tensor<64x128xbf16>, tensor<64x64xbf16>) -> tensor<64x64xbf16>
    return %1 : tensor<64x64xbf16>
  }
}

Couple things to point out about this process:

Tests placed under test/ttmlir/Dialect will only test the compiler's capability of compiling the module. If you want the module to run on silicon in CI, the test must be placed under test/ttmlir/Silicon.
Notice the differences between the compilation headers of test/ttmlir/Silicon/TTNN/simple_matmul.mlir and test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir
- --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%": The system-desc-path option specifies the location of the system descriptor required for compiling the module. This is crucial for silicon tests, as modules compiled with different system descriptors may vary in silicon compatibility. Ensuring the system descriptor accurately reflects the target hardware is essential for running the module correctly.
- // RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn: This runs ttmlir-translate that serializes the output mlir module to a flatbuffer binary. We added the logic for this serialization in the Serialize the Op in the flatbuffer format section.

9. Add an EmitC test for the Op

Op should be tested in the EmitC (C++ codegen) path as well.

TTNN EmitC tests live in the test/ttmlir/EmitC/TTNN path. In our case, the test is in test/ttmlir/EmitC/TTNN/matmul/matmul.mlir.

`test/ttmlir/EmitC/TTNN/matmul/matmul.mlir`

// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%" -o %t.mlir %s
// RUN: ttmlir-translate --ttnn-to-flatbuffer -o %basename_t.ttnn %t.mlir
// RUN: ttmlir-opt --ttnn-tuplify-tensors --convert-ttnn-to-emitc -o %t2.mlir %t.mlir
// RUN: ttmlir-translate --mlir-to-cpp -o %basename_t.cpp %t2.mlir

func.func @matmul(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
  %0 = ttir.empty() : tensor<64x96xbf16>
  %1 = "ttir.matmul"(%arg0, %arg1, %0) : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
  return %1 : tensor<64x96xbf16>
}

The first two RUN lines create a flatbuffer. The third and forth convert to EmitC dialect, translate to C++, then output the result to matmul.mlir.cpp file.

Additionally, the op's header file operations/matmul/matmul.hpp should be added to the list of includes in tools/ttnn-standalone/ttnn-precompiled.hpp:

#include "operations/ccl/all_gather/all_gather.hpp"
#include "operations/ccl/ccl_host_types.hpp"
#include "operations/ccl/reduce_scatter/reduce_scatter.hpp"
#include "operations/conv/conv2d/conv2d.hpp"
#include "operations/conv/conv2d/prepare_conv2d_weights.hpp"
#include "operations/conv/conv_transpose2d/conv_transpose2d.hpp"
#include "operations/core/core.hpp"
#include "operations/creation.hpp"
#include "operations/data_movement/concat/concat.hpp"
#include "operations/data_movement/permute/permute.hpp"
#include "operations/data_movement/repeat/repeat.hpp"
#include "operations/data_movement/repeat_interleave/repeat_interleave.hpp"
#include "operations/data_movement/slice/slice.hpp"
#include "operations/data_movement/sort/sort.hpp"
#include "operations/data_movement/transpose/transpose.hpp"
#include "operations/eltwise/binary/binary.hpp"
#include "operations/eltwise/binary/binary_composite.hpp"
#include "operations/eltwise/quantization/quantization.hpp"
#include "operations/eltwise/unary/unary_composite.hpp"
#include "operations/embedding/embedding.hpp"
#include "operations/embedding_backward/embedding_backward.hpp"
#include "operations/matmul/matmul.hpp"
#include "operations/moreh/moreh_cumsum/moreh_cumsum.hpp"
#include "operations/normalization/batch_norm/batch_norm.hpp"
#include "operations/normalization/softmax/softmax.hpp"
#include "operations/pool/generic/generic_pools.hpp"
#include "operations/pool/upsample/upsample.hpp"
#include "operations/rand/rand.hpp"
#include "operations/reduction/argmax/argmax.hpp"
#include "operations/reduction/generic/generic_reductions.hpp"
#include "operations/reduction/prod/prod.hpp"
#include "operations/transformer/concatenate_heads/concatenate_heads.hpp"
#include "tt-metalium/bfloat16.hpp"
#include "ttnn/core.hpp"
#include "ttnn/device.hpp"
#include "ttnn/operations/copy/typecast/typecast.hpp"
#include "ttnn/tensor/tensor.hpp"
#include "ttnn/tensor/types.hpp"
#include "ttnn/types.hpp"
#include "workarounds.hpp"

Decomposing an Op in TTIR

This guide explains how to add and decompose a new operation in the TTIR dialect. We’ll focus on adding an Index operation, which will be decomposed into the Slice operation. The decomposition is implemented as a conversion pass in MLIR since it allows us to mark operations or dialects as legal or illegal, type conversion...

This guide will cover the following steps:

Decomposing an Op in TTIR

1. Define the Op in the TTIR frontend dialect

The more information regarding this step can be found here: Define the Op in the TTIR frontend dialect

I updated the TTIROps.td as following:

def TTIR_IndexOp: TTIR_NamedOp<"index"> {
    let summary = "Tensor indexing operation.";
    let description = [{
      The `index` operation extracts a sub-tensor (slice) from the input tensor along a specified dimension.

      This operation selects elements from the input tensor along a single dimension based on the specified
      begin, end, and step indices. It's similar to Python's slicing notation `tensor[:, begin:end:step, :]`
      where the slicing is applied only to the specified dimension.

      Example:
      ```mlir
      // Extract elements with indices 1, 3, 5 from dimension 0 of a 1D tensor
      %input = ... : tensor<6xf32>  // Input tensor with values: [1, 2, 3, 4, 5, 6]
      %output = ttir.empty() : tensor<3xf32>  // Output tensor shape
      %result = ttir.index(%input, %output) {
          dim = 0 : i32,    // Dimension to index
          begin = 1 : i32,  // Start index
          end = 6 : i32,    // End index (exclusive)
          step = 2 : i32    // Step size
      } : tensor<6xf32>, tensor<3xf32> -> tensor<3xf32>
      // Result: [2, 4, 6]

      // Extract columns 0 and 2 from a 2D tensor
      %input = ... : tensor<3x4xf32>  // Input tensor with values:
                                      // [[1, 2, 3, 4],
                                      //  [5, 6, 7, 8],
                                      //  [9, 10, 11, 12]]
      %output = ttir.empty() : tensor<3x2xf32>  // Output tensor shape
      %result = ttir.index(%input, %output) {
          dim = 1 : i32,    // Index along columns (dimension 1)
          begin = 0 : i32,  // Start from first column
          end = 3 : i32,    // End at third column (exclusive)
          step = 2 : i32    // Take every other column
      } : tensor<3x4xf32>, tensor<3x2xf32> -> tensor<3x2xf32>
      // Result:
      // [[1, 3],
      //  [5, 7],
      //  [9, 11]]
      ```

      Inputs:
      - `input` (Tensor): The input tensor to index.

      Attributes:
      - `dim` (Integer): The dimension along which to index.
      - `begin` (Integer): The starting index.
      - `end` (Integer): The ending index (exclusive).
      - `step` (Integer): The step size between indices.

      Outputs:
      - `result` (Tensor): The indexed tensor.

      Note: The shape of the output tensor is the same as the input tensor except for the indexed dimension,
      which will have size `ceil((end - begin) / step)`. The indices selected will be `begin`, `begin + step`,
      `begin + 2*step`, etc., up to but not including `end`.
    }];

    let arguments = (ins AnyRankedTensor:$input,
                         AnyRankedTensor:$output,
                         I32Attr:$dim,
                         I32Attr:$begin,
                         I32Attr:$end,
                         I32Attr:$step);

    let results = (outs AnyRankedTensor:$result);

    let hasVerifier = 1;
}

The verification function has been added as well:

// IndexOp verification
::mlir::LogicalResult mlir::tt::ttir::IndexOp::verify() {
  ::mlir::RankedTensorType inputType = getInput().getType();
  ::llvm::ArrayRef<int64_t> inputShape = inputType.getShape();
  ::mlir::RankedTensorType outputType = getOutput().getType();
  int32_t dim = getDim();
  int32_t begin = getBegin();
  int32_t end = getEnd();
  int32_t step = getStep();

  // Verify that the input is at least 1D tensor
  if (inputType.getRank() < 1) {
    return emitOpError("Input must be at least a 1D tensor");
  }

  // Validate that the output tensor has the same element type as the input
  // tensor
  if (inputType.getElementType() != outputType.getElementType()) {
    return emitOpError(
        "Output tensor must have the same element type as the input tensor");
  }

  // Verify the output tensor rank
  if (inputType.getRank() != outputType.getRank()) {
    return emitOpError(
        "Output tensor must have the same rank as the input tensor");
  }

  // Verify that the dim attribute is within the bounds of the input tensor
  if (dim < 0 || dim >= inputType.getRank()) {
    return emitOpError() << "Invalid dimension index " << dim
                         << ". Input tensor rank is " << inputType.getRank();
  }

  // Verify begin, end, step and the output tensor dimensions
  int64_t dimSize = inputShape[dim];

  // Adjust negative begin and end
  int32_t adjustedBegin = (begin < 0) ? (begin + dimSize) : begin;
  int32_t adjustedEnd = (end < 0) ? (end + dimSize) : end;

  std::ostringstream inputShapeStream;
  inputShapeStream << "(";
  for (size_t i = 0; i < inputShape.size(); ++i) {
    inputShapeStream << inputShape[i];
    if (i != inputShape.size() - 1) {
      inputShapeStream << ", ";
    }
  }
  inputShapeStream << ")";
  std::string inputShapeStr = inputShapeStream.str();

  if (adjustedBegin < 0 || adjustedBegin >= dimSize) {
    return emitOpError() << "Invalid begin index for dimension "
                         << std::to_string(dim) << ". Expected value in range ["
                         << std::to_string(-dimSize) << ", " << dimSize
                         << "), got " << begin
                         << ". Input shape: " << inputShapeStr;
  }
  if (adjustedEnd < 0 || adjustedEnd > dimSize) {
    return emitOpError() << "Invalid end index for dimension "
                         << std::to_string(dim) << ". Expected value in range ["
                         << std::to_string(-dimSize) << ", " << dimSize
                         << "], got " << end
                         << ". Input shape: " << inputShapeStr;
  }

  auto formatValueMessage = [](int value, int adjustedValue) {
    return value < 0 ? std::to_string(adjustedValue) + " (" +
                           std::to_string(value) + ")"
                     : std::to_string(value);
  };
  std::string beginValueMessage = formatValueMessage(begin, adjustedBegin);
  std::string endValueMessage = formatValueMessage(end, adjustedEnd);

  if (step == 0) {
    return emitOpError("Step value for dimension " + std::to_string(dim) +
                       " cannot be zero");
  }

  if (step > 0 && adjustedBegin > adjustedEnd) {
    return emitOpError() << "For positive step, begin index must be less "
                            "than or equal to end index for dimension "
                         << dim << ". Got begin: " << beginValueMessage
                         << ", end: " << endValueMessage << ", step: " << step
                         << ", input shape: " << inputShapeStr;
  }

  if (step < 0 && adjustedBegin < adjustedEnd) {
    return emitOpError() << "For negative step, begin index must be greater "
                            "than or equal to end index for dimension "
                         << dim << ". Got begin: " << beginValueMessage
                         << ", end: " << endValueMessage << ", step: " << step
                         << ", input shape: " << inputShapeStr;
  }

  // Calculate the expected size of the output dimension
  int32_t expectedDimSize =
      (std::abs(adjustedEnd - adjustedBegin) + std::abs(step) - 1) /
      std::abs(step);
  if (outputType.getDimSize(dim) != expectedDimSize) {
    return emitOpError() << "Mismatch in dimension " << std::to_string(dim)
                         << " of the output tensor: expected size "
                         << expectedDimSize << ", but got "
                         << outputType.getDimSize(dim);
  }

  return success();
}

2. Create a conversion pattern

A conversion pattern defines how MLIR should rewrite the Op. It can be implemented in either C++ or TableGen. Currently, we only have the C++ implementation; TableGen format will be added in the future.

`C++ conversion pattern`

For the Index operation, we use the C++ conversion pattern because it involves changing the Op’s input types from integers to arrays, which TableGen lacks flexibility for.

// This transformation adjusts IndexOp attributes so that `begin`, `end`, and
// `step` become arrays, where each array element corresponds to a dimension of
// the input tensor. For dimensions other than the sliced dimension, default
// values are used.
//
namespace {
struct IndexToSliceConversionPattern
    : public OpConversionPattern<ttir::IndexOp> {
  using OpConversionPattern<ttir::IndexOp>::OpConversionPattern;

  LogicalResult
  matchAndRewrite(ttir::IndexOp op, OpAdaptor adaptor,
                  ConversionPatternRewriter &rewriter) const override {
    auto inputType =
        ::mlir::dyn_cast<mlir::RankedTensorType>(adaptor.getInput().getType());
    if (!inputType || !inputType.hasRank()) {
      return failure();
    }

    int64_t rank = inputType.getRank();
    llvm::SmallVector<mlir::Attribute, 4> begins, ends, steps;

    for (int64_t i = 0; i < rank; ++i) {
      if (i == op.getDim()) {
        begins.push_back(rewriter.getI32IntegerAttr(adaptor.getBegin()));
        ends.push_back(rewriter.getI32IntegerAttr(adaptor.getEnd()));
        steps.push_back(rewriter.getI32IntegerAttr(adaptor.getStep()));
      } else {
        begins.push_back(rewriter.getI32IntegerAttr(0));
        ends.push_back(rewriter.getI32IntegerAttr(inputType.getDimSize(i)));
        steps.push_back(rewriter.getI32IntegerAttr(1));
      }
    }

    auto newOp = rewriter.create<ttir::SliceOp>(
        op.getLoc(), op.getType(), adaptor.getInput(), adaptor.getOutput(),
        rewriter.getArrayAttr(begins), rewriter.getArrayAttr(ends),
        rewriter.getArrayAttr(steps));

    rewriter.replaceOp(op, newOp.getResult());
    return success();
  }
};
} // namespace

The matchAndRewrite method from OpConversionPattern is implemented to replace the matched Op with the newly created Op. Since decomposition is implemented as a conversion pass, OpAdaptor is used to access the attributes of the original Op in their converted types. Finally, we instantiate the new Op and call the replaceOp method on ConversionPatternRewriter to replace the original Op.

`Tablegen conversion pattern`

TODO

3. Register the created conversion pattern

To register the new pattern, go to the populateTTIRToTTIRDecompositionPatterns function in TTIRToTTIRDecomposition.cpp and add it to RewritePatternSet using the add method. After that is done you should mark the decomposed op as illegal in runOnOperation method of TTIRToTTIRDecompositionPass in TTIRToTTIRDecompositionPass.cpp.

You should also add a silicon test like described here: Add a silicon unit test for the Op. This is how the silicon test for the Index operation looks like:

// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%" -o %t.mlir %s
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer -o %t.ttnn %t.mlir
module attributes {} {
  func.func @forward(%arg0: tensor<4x32x32xbf16>) -> tensor<4x32x16xbf16> {
    %0 = ttir.empty() : tensor<4x32x16xbf16>
    // CHECK: = "ttnn.slice"
    %1 = "ttir.index"(%arg0, %0) <{dim = 2: i32, begin = 0: i32, end = 32: i32, step = 2: i32}> : (tensor<4x32x32xbf16>, tensor<4x32x16xbf16>) -> tensor<4x32x16xbf16>
    return %1 : tensor<4x32x16xbf16>
  }
}

Docs & Doxygen

Markdown documentation is built using mdbook and API documentation is built using doxygen, and sphinx, and sphinx-markdown-builder.

Markdown documentation (docs)

Requirements

The markdown documentation is built using mdbook and sphinx.

Build command

If not already installed, be sure to install sphinx-markdown-builder.

pip install sphinx-markdown-builder

To build the markdown docs use the docs target in CMake.

cmake -B build
cmake --build build -- docs

API documentation (doxygen)

This is a link to a doxygen autogenerated code reference. Doxygen

Requirements

The API documentation is built using doxygen and sphinx, here are the needed tools for building it:

Build command

To build the API docs use the doxygen target in CMake

cmake -B build
cmake --build build -- doxygen

Serving the docs locally

To start a server for local viewing of the docs, after building, run:

mdbook serve build/docs

mdbook will start a local server at http://localhost:3000 with the built docs.

Specifications

Specifications are documents that define the requirements for features or concepts that are particularly cross-cutting, complex, or require a high degree of coordination and planning. They are intended to be a living document that evolves as the feature is developed and should be maintained as the goto reference documentation for the feature or concept.

Specifications are written in markdown and are stored in the docs/src/specs directory of the repository. Below is a template that should be used when creating a new specification.

Specification Template

# [Title]

A brief description of the feature or concept that this specification is
defining.

## Motivation

A description of why this feature or concept is needed and what problem it is
solving. This section is best written by providing concrete examples and use
cases.

## Proposed Changes

A list of the components that will be impacted by this spec and a detailed
description of the changes that will be made to each respective component.

It should also call out any interactions between components and how they might
share an interface or communicate with each other.

## Test Plan

A brief description of how the feature or concept will be tested.

## Concerns

A list of concerns that have been identified during the design of this feature.

Runtime Stitching

Runtime stitching adds the ability for the runtime to stitch together multiple, indepently compiled programs together at runtime, ie. without compiler knowledge of how the binary programs will be composed.

Motivation

In order to flexibly support arbitrary training schedules / composing multiple models together we want to have the ability for the runtime to stitch graphs together. To achieve this we need to define an ABI kind of interface between the compiler and the runtime.

Simple Example

mod_a = forge.compile(PyTorch_module_a)
mod_b = forge.compile(PyTorch_module_b)

for i in range(10):
    outs_a = mod_a(ins_a)
    outs_b = mod_b(outs_a)

mod_a and mod_b are 2 independent compile steps, during the compile step for mod_a it should be completely unaware that mod_b will take place and vice-versa. In order to achieve this we propose a new runtime concept called stitching:

forge invokes compile step for mod_a, tt-mlir compiler determines where the inputs (ins_a) should live, host, device dram, device l1. tt-mlir returns metadata to forge describing where it wants the tensors to reside before invoking flatbuffer submission.
forge invokes compile step for mod_b, same happens as bullet 1
mod_a is invoked at runtime, forge runtime needs to inspect the compiler metadata to determine where the tensors should live. Runtime manually invokes a new data copy command to get the tenors to the correct memory space / correct memory address.
forge runtime invokes mod_a program submit
mod_b is invoked at runtime, this time it might be that the compiler left the tensor outputs in L1, so no data copy is needed to start running mod_b since the inputs are already in the correct location.

A more concrete usecase would be a training loop where there are often multiple graphs composed together. #82 Or when we eventually support torch 2.0, the torch runtime can arbitrarily break the graph anywhere.

Proposed Changes

Compiler Metadata

Compiler will encode the input tensor layout information directly into the flatbuffer tensor desc. The flatbuffer schema already exists to express this, we just need to adopt populating it instead of assuming a canonical host layout.

Compiler will decide where the tensors should live, host, device dram, device l1.

Runtime

Runtime will inspect the tensor desc metadata to determine where the tensors need to end up / what layout they should be in before invoking the program.
New runtime API Tensor toLayout(Tensor tensor, ::tt::target::TensorDesc* tensorDesc);
Runtime will need to invoke toLayout on all input tensors before invoking the program.

Test Plan

Add a new test to the runtime gtest suite that verifies the runtime can correctly stitch together 2 independently compiled programs.

Concerns

Tensors pass through device memory spaces (dram, L1) will have a dynamic address, some arbitrary run order of flatbuffer could cause tensors to end up in non-ideal locations in memory. Specifically, L1, a poorly placed tensor might not be able to be moved to a better location without a bounce through DRAM.

Tensor Layout

The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.

Motivation / High level goals

Logical shapes: Keep the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
Flexible sharding: Enable flexibility in choosing grid shape, to get better parallelization and avoid resharding. This is particularly important in cases where tensor shapes are not clean powers of two and would otherwise force our hand in choosing non-optimal grid shapes.
Logical-Physical Isomorphism: Encode this information with just a few attributes to enable derived conversions from logical to physical layout and back.
Explicit: A single source of truth.
Enable a direct way to query padded regions.

An Example / Walkthrough

Let's consider a snippet of MLIR:

tensor<2x3x64x128xf32>

Here we've defined a 4 dimensional tensor using MLIR's builtin tensor type. This tensor type has an optional attribute called an Encoding, this attribute has been used by the TT dialect to encode the tensor's layout. This looks like:

tensor<2x3x64x128xf32,
  #ttcore.metal_layout<
    (d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
    undef,
    <1x1>,
    memref<384x128xf32, #ttcore.memory_space<l1>>
  >
>

At the time of this writing there are 4 properties that make up a tensor layout:

linear: An affine map that defines how the logical tensor dimensions map to a grid shape. Note that the number of dims in the affine map must match exactly the rank of the original tensor, and the number of results must match exactly the rank of the grid shape.
oob_val: A tracked out of bounds value that fills padding space.
grid: The grid shape that this tensor is divided onto.
memref: A memref that describes the physical footprint allocation of the shard. It must also have a shape with rank equal to grid.

This example isn't particularly complicated because it's only sharded to a 1x1 grid, the rest of the document will go into more details on the following topics:

Before we jump into more advanced topics there are two resources that could be useful to have at hand:

test/python/tensor_layout.py: Python test with many convenience functions for creating and experimenting with tensor layouts.

TTNN Interactive Visualizer: An interactive visualation tool that demonstrates the transformation. Note that this tool was created for TTNN tensor layout, but many of the same concepts transfer over.

Dimension Collapsing

Probably the most important concept in ttcore.metal_layout is dimension collapsing. This is captured by the affine map linear property which provides a mapping from tensor dim space to a reduced physical dimensional space. This single-handedly touches on most of the tensor layout goals mentioned at the beginning of the doc:

Leaves tensor shapes intact
Logical-Physical mapping, how the tensor is laid out in memory over a grid
Enables more flexible sharding
Explicit padding

To see how these goals are achieved we'll continue working on an explicit example, same one as above:

(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3)

To recap, we have our example 4d tensor (2, 3, 64, 128), which maps directly to the LHS (d0, d1, d2, d3). We have our 2d grid shape (1, 1), notice the affine-map RHS is also 2d, and this describes how tensor dims map to a lower dimensional physical memory, overlaid on a grid. We'll see how this gets divided onto the grid later, but first let's look at how this forms an affine-map iteration space. If we index our tensor at say [1, 1, 6, 100], we can simply plugin those numbers to get our remapped offset:

(1 * 192 + 1 * 64 + 6, 100) = (262, 100)

This remapped offset (262, 100) corresponds to the row and column index of the collapsed physical memory.

By default, the dim range [0, -1) is collapsed, but the ttcore.metal_layout contructor can actually take a programmable range called collapseIntervals. collapseIntervals is a list of pairs, where each pair is a dim range interval, left inclusive, right exclusive. Let's consider a few examples:

Instead of multiplying out real shapes, we will use <> to represent a dimension join operator.

3D tensor onto a 2D grid and default collapseIntervals=[(0, -1)]:

(d0, d1, d2) -> (d0 <> d1, d2)

4D tensor onto a 3D grid and collapseIntervals=[(1, -1)]:

(d0, d1, d2, d3) -> (d0, d1 <> d2, d3)

4D tensor onto a 3D grid and collapseIntervals=[(0, 2)]:

(d0, d1, d2, d3) -> (d0 <> d1, d2, d3)

7D tensor onto a 4D grid and collapseIntervals=[(0, 3), (-3, -1)]:

(d0, d1, d2, d3, d4, d5, d6) -> (d0 <> d1 <> d2, d3, d4 <> d5, d6)

Multi-core

Let's consider the original example again, but on a larger grid than 1x1, say 2x4:

tensor<2x3x64x128xf32,
  #ttcore.metal_layout<
    (d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
    undef,
    <2x4>,
    memref<192x32xf32, #ttcore.memory_space<l1>>
  >
>

The number of affine map results, grid shape, and memref shape all must have the same rank. We can see in this example by changing the grid shape we also changed the memref shape, we can always calculate the memref shape by plugging in the full tensor dims into our affine map and then dividing by grid shape.

(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
(2 - 1, 3 - 1, 64 - 1, 128 - 1) = (1 * 192 + 2 * 64 + 63, 127) = (383, 127)

Above we actually subtracted 1 in order to get the index of the last element of the tensor. Now we can simply add back 1 to get the size:

(383 + 1, 127 + 1) = (384, 128)

Finally, we divide the dims by the respective grid dims:

(384 / 2, 128 / 4) = (192, 32)

Here's a few more example mlir snippets:

tensor<8x300xf32,
  #ttcore.metal_layout<(d0, d1) -> (d0, d1),
    undef,
    <1x2>,
    memref<8x150xf32, #ttcore.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #ttcore.metal_layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
    undef,
    <2x1>,
    memref<384x32xf32, #ttcore.memory_space<l1>>
  >
>

tensor<8x96x32xf32,
  #ttcore.metal_layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
    undef,
    <2x1x2>,
    memref<384x96x16xf32, #ttcore.memory_space<l1>>
  >
>

tensor<5x3x2x2x7x32x32xf32,
  #ttcore.metal_layout<
    (d0, d1, d2, d3, d4, d5, d6)
      -> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
    undef,
    <3x2x2x2>,
    memref<4480x4x16x16xf32, #ttcore.memory_space<l1>>
  >
>

A couple of final notes regarding grid shape:

Grid shapes of rank > 2 are perfectly legal. Not only it this useful for describing multi-device grid topologies, but it is often convenient to have higher ranked grids to better describe how a high rank tensor should be divided. The grid shape here is a virtual grid shape, the ttcore.device attribute will hold an additional affine map that defines how this virtual grid shape maps to a physical one.
Grid shapes where either columns or rows are > physical device grid is also legal. Since this is only a virtual grid shape we could have some grid 1x64 that maps to a physical 8x8 device grid (this particular example is called width sharding in TTNN).

Tilized

A tilized tensor is one with a memref that has a tile element type.

Given some tensor with scalar layout:

tensor<3x64x128xf32,
  #ttcore.metal_layout<
    (d0, d1, d2) -> (d0 * 64 + d1, d2),
    undef,
    <3x2>,
    memref<64x64xf32, #ttcore.memory_space<l1>>
  >
>

After tilizing we'll have:

tensor<3x64x128xf32,
  #ttcore.metal_layout<
    (d0, d1, d2) -> (d0 * 64 + d1, d2),
    undef,
    <3x2>,
    memref<2x2x!ttcore.tile<32 x 32, bfp_bf8>, #ttcore.memory_space<l1>>
  >
>

Notice the memref dim was ceilDiv'd by tile shape and the element type becomes a ttcore.tile type. Also notice that the tensor shape and element type remains intact.

Padding

Padding can be a bit of an overloaded term, but in this context it refers to an out of bounds area in the physical memory allocation that has no real tensor data in it. The contents of this area is tracked by oob_val and the padding area can be automatically derived from the attributes of ttcore.metal_layout.

Padding is a necessary evil that arises when a tensor is not evenly divisible by a grid shape or tile shape. It can also arise due to minimum Noc addressing requirements.

Example of non-divisible grid:

tensor<53x63xf32,
  #ttcore.metal_layout<
    (d0, d1) -> (d0, d1),
    undef,
    <3x2>,
    memref<18x32xf32, #ttcore.memory_space<l1>>
  >
>

The grid dims always ceilDiv the affine map results, real tensor data will entirely fill initial shards and the last shard in each dimension will be partially filled.

In this particular example, we have 1 scalar row of padding on the last row of cores and 1 scalar column of padding on the last column of cores.

Taking the above example a step further, we could tilize it:

tensor<53x63xf32,
  #ttcore.metal_layout<
    (d0, d1) -> (d0, d1),
    undef,
    <3x2>,
    memref<1x1x!ttcore.tile<32 x 32, bfp_bf8>, #ttcore.memory_space<l1>>
  >
>

Tile dims also always ceilDiv the resulting memref shape. Notice now that the padding is slightly more complicated. Our scalar shard shape was 18x32, but this was further padded to 32x32 meaning that every core now has 14 rows of padding except for the last row of cores which has 15 rows of padding.

Also note that there is an order of operations here, grid divides the scalar shape first and then we tilize. This is important because it can enable use cases that frequently arise in conv networks that would otherwise result in reshards in between every layer.

With affine map we can be even more flexible in how we pad, we can bump our stride between dimensions. Consider tensor (w/ batch dim 2):

tensor<2x8x32xf32,
  #ttcore.metal_layout<
    (d0, d1, d2) -> (d0 * 8 + d1, d2),
    undef,
    <1x2>,
    memref<16x16xf32, #ttcore.memory_space<l1>>
  >
>

If we tilized the above tensor we'd end up with a memref shape of 1x1x!ttcore.tile<32x32>, that is, all batches are tightly packed within a single tile. Let's say that for some reason, we do not want the batches (2) to be tightly packed within a tile, perhaps the mathematical operation we're doing requires the batch to be independently evaluated and thus the (S)FPU needs them in separate tiles. We can adjust this by adjusting the stride of the affine map:

(d0, d1, d2) -> (d0 * 32 + d1, d2),

Instead of striding by the number of logical rows, 8, we bump the stride up to 32 effectively pushing a gap between the collapsed rows and enabling each batch to fall on a tile boundary.

Memory Spaces

At the time of writing this document there are 4 memory spaces:

System: Host memory space that is not device visible.
SystemMMIO: Host memory space that is device visible.
DeviceDRAM: DRAM local to the device.
DeviceL1: SRAM on each core.

Something worth noting here is that a tensor must belong exclusively to only one of these memory spaces at a time. For example, in order to stream tensor data from DeviceDRAM to DeviceL1 you would need to either manually slice the tensor into smaller tensors that do fit in L1 or have native support in the op's kernel for double buffering a block (most TTNN ops already support this).

Multi-device

Multi-device can be naturally represented via a combination of two concepts already touched on above, higher ranked grids and collapseIntervals. Let's consider the following example with a 3d grid and collapseIntervals=[(1, -1)].

tensor<2x3x64x128xf32,
  #ttcore.metal_layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
    undef,
    <2x2x4>,
    memref<1x3x1x!ttcore.tile<32 x 32, bfp_bf8>, #ttcore.memory_space<l1>>
  >
>

Here we've left the batch dim intact and started collapsing at d1. This enables us to define a 3d grid where the outermost grid dim divides the batch directly. This could map to a 2 device system where the batch dim is evenly divided between 2 devices. Within each device this op runs on a 2x4 grid.

The high level takeaway here is that how a tensor is logically divided up is decoupled from its mapping to physical compute resources. This has a nice property that data parallel extends to any tensor dimension and is captured under the same grid primitive that also divides tensor rows and columns.

Test Plan

test/python/tensor_layout.py: Assertions for LayoutAttr to make sure it's spec compliant.
Sweep tests:
- Grid dim sweeps
- Tilize / untilize sweeps
- Padding sweeps
Multi-device tests

Concerns

ttcore.metal_layout is deliberately flexible and tries to capture as many problematic use-cases we've ran into in the past in a single, succinct representation. This flexibility will need to be further constrained by backends to avoid unsupported programming of this attribute.
Optimization solution space is potentially large with all of this flexibility. Two things that I hope can help protect us here:
- By and large the heuristic we'll be following is just max the grid at all costs. This should really narrow down the solution space to only a handful of options and we only keep exploring if producers/consumers end up with nasty reblocking.
- We can constrain the optimizer heuristics as aggressively as possible in the beginning and just advertise the full flexible options to the UI model explorer. Hopefully this enables us to experiment with crazier grid layouts and prove it's worthwhile before writing an algorithm.

TTNN Tensor Layout

The above section of this document covers how the compiler models tensor layout. There are some slight differences in TTNN, but the high level idea of collapsing dims is still used.

Terms

shape: Always logical shape, n-dimensional
stride: Same as pytorch stride, but this is crucial for describing how n-dimensional data gets packed into a 2D physical layout. This 2D physical layout is always the inner dim (-1) wide and dims [0, N-1] are collapsed into rows derived from stride
shard_shape: Also a logical shape, describes a 2d region that chunks physical_shape . Note this does not need to be a tile multiple
physical_shard_shape: The shard_shape padded out to tile_shape
tile_shape: A programmable tile shape, though constraints must check that it's compatible with an op's usage, i.e. FPU/Noc compatible
grid_shape: [divup(stride[0] // stride[-2], shard_shape[0]), divup(stride[-2], shard_shape[0])]

Mapping from the compiler

The compiler uses an affine map to explicitly track which dimensions are folded together, but TTNN does not have affine maps so the representation is a bit more implicit. TTNN captures the dimension collapsing in the stride attribute where dimensions [0, N-1] are always collapsed. This is less flexible so the compiler will have to enforce only collapsing supported dimensions when targeting TTNN, or handle lowering in a different way. For example, in the compiler we might want to represent data parallel over the tensor batch dim by leaving d0 and collapsing d1 - d[-1]. TTNN doesn't support this in its tensor layout representation, but this could be lowered to a TTNN mesh tensor where the mesh could be sliced on the batch and each per-device tensor has d0 fully collapsed.

TTNN Example

Alt text

Device

Device in tt-mlir is somewhat of an overloaded term and can refer to different things depending on the context. This document will only speak to the compiler's abstract representation of a device captured by attribute #ttcore.device.

Terms

There are many overloaded terms when talking about devices and grids, this document will use the following definitions:

Physical Grid: A 2D array of tensix cores on a chip.
Chip: A single physical chip with a Physical Grid of cores.
Card: A PCIE or Ethernet card that may contain multiple Chips.
System: A collection of Cards that are usually connected together on the same host via PCIE or networked via ethernet. A system is represented by SystemDesc in the compiler.
Device: Device is always presented as a single entity to the enclosing scope, but it may be virtualized to abstract a multi-card System and part of its encoding carries a Logical Grid. Another way to think of device is a view over the system.
Logical Grid or just Grid: Is a logical shape that abstracts one or more Physical Grids.
Mesh Shape: Describes the virtual layout of the chips with respect to each other. In practice the mesh shape is used to derive the logical grid.

Motivation

The device attribute strives to achieve the following goals:

Provide a convenient representation of a physical grid that decouples the logical division of tensors from the physical layout of the hardware. This not only simplifies reasoning about how tensors get divided into shards, but can also enable reinterpretations of the device grid for data layout optimization decoupled from the existing encoding of the tensor layouts.
Following the first point, the device attribute should be able to represent many different forms of logical grids, from simple 2D grids, to more complex topologies like extra-wide grids or higher dimensional grids.
Device attribute captures encoding both single chip and multi-chip systems under a single, virtualized representation.
Enable many forms of data parallel execution strategies for single and multi chip systems under a single representation.

Scope

This document will cover how the device attribute is encoded and how it can be lowered to backend dialects. The document will not cover the algorithm for choosing the best, or even legal, device configurations for a given physical system.

Examples

All of the following examples will assume the physical hardware has an 8x8 physical grid of cores. We will use notation [N, 8x8] to represent a N chip system, each with an 8x8 physical grid.

#ttcore.device in is simplest, single chip form [1, 8x8], just maps directly 1-1 to the underlying physical hardware device.

#ttcore.device<
  workerGrid = #ttcore.grid<8x8, (d0, d1) -> (0, d0, d1)>,
  meshShape = 1,
  chipIds = [0]
>

Let's break down what each of these attributes mean:

workerGrid = #ttcore.grid<8x8, (d0, d1) -> (0, d0, d1)>: This is a 2D logical grid with dim 8x8. It's followed by an affine map (d0, d1) -> (0, d0, d1) that provides a mapping from the logical grid to the physical grid. In this case, the logical grid is the same as the physical grid, so the mapping is the identity function. The logical grid can have any rank, but the physical mapping is always 3D, with the first being the chip index, followed by the 2D physical core index within the chip.
meshShape = 1: A shape provided as part of the DeviceAttr constructor that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.
chipIds = [0]: This is a list of chip indices. These chip indices directly reference the same chip indices in the system descriptor. The SystemDesc attribute that this is in reference to is tagged on the top level ModuleOp.

Specific examples that this document will cover:

Before we move on to more complex examples, it's worth having on hand:

The python test test/python/device_attr.py which shows how all of these examples can actually be programmed for the device attribute.

The Tensor Layout spec as the following examples will demonstrate how tensor layout interacts with the logical device grid.

Note on Data Parallel: There is existing literature that explicitly distinguishes between data parallel and tensor parallel, oftentimes describing data parallel as duplicating the model across multiple devices and trivially dividing up the batch whereas tensor parallel refers to tensor data being distributed and potentially communicated between devices during execution. While this is true for multi-GPU/CPU systems, it is somewhat of an implementation detail and given the flexibility of tenstorrent hardware there is an opportunity to generalize this concept. In this document we will use the term data parallel to refer to any form of parallelism that divides any dimension of the tensor across multiple cores/chips.

Note on Constraints: Many of the examples below require careful virtualization of the underlying physical system, i.e. some device configurations might only work if the chips are connected via ethernet and with a particular topology, but these constraints are outside the scope of the examples and will be discussed further in the Backend Lowering and Constraints section.

Data Parallel Over Batch

Given a 2 chip system, [2, 8x8], we can represent a simple data parallel logical grid that divides the batch dimension in half across the two chips. This is denoted by meshShape = 2x1x1 which means the logical grid is 3D.

#ttcore.device<
  workerGrid = #ttcore.grid<2x8x8, (d0, d1, d2) -> (d0, d1, d2)>,
  meshShape = 2x1x1,
  chipIds = [0, 1]
>

The affine map here is just identity, so dims d1 and d2 directly index the physical grid and d0 indexes the chip.

Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:

tensor<16x3x64x128xf32,
  #ttcore.metal_layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
    undef,
    <2x2x4>,
    memref<8x3x1x!ttcore.tile<32 x 32, bfp_bf8>, #ttcore.memory_space<l1>>
  >
>

If we map this tensor onto the above device, it will span across both chips, half of the batch dimension on each chip. Within each chip the tensor occupies a 2x4 grid out of the 8x8 physical grid available.

Data Parallel Over 2d

In this example we will consider a 2 chip system, [2, 8x8], and view it as though the two chips are concatenated together side by side to form a single 8x16 grid. This is denoted by meshShape = 1x2 which means to concatenate the chips in the second dimension.

#ttcore.device<
  workerGrid = #ttcore.grid<8x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0, d1 mod 8)>,
  meshShape = 1x2,
  chipIds = [0, 1]
>

Here we can see that the affine map encodes an indexing pattern such that when we extend past 8 cores in the second dimension, we wrap around to the next chip.

Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:

tensor<256x1024xf32,
  #ttcore.metal_layout<(d0, d1) -> (d0, d1),
    undef,
    <4x16>,
    memref<2x2x!ttcore.tile<32 x 32, bfp_bf8>, #ttcore.memory_space<l1>>
  >
>

This single tensor maps trivially onto the logical grid, spanning the upper half. Decoupled from the tensor's layout, under the hood the tensor is actually physically spanning across two chips.

Data Parallel Over 2d and Batch

The previous 2 examples can be composed together to form a logical grid that divides tensor across multiple dimensions. Here we will consider a 4 chip system [4, 8x8] and view it as a 2x8x16 grid. Note that the meshShape is 2x1x2 which means to concatenate the chips in the first and third dimensions.

#ttcore.device<
  workerGrid = #ttcore.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
  meshShape = 2x1x2,
  chipIds = [0, 1, 2, 3]
>

We can evaluate the affine map to see that the chips are interpreted in chunks of two, where groups [0, 1] and [2, 3] each form 8x16 grids and these 2 groups concatenate to form a 2x8x16 grid.

We can consider the following tensor to map onto this grid:

tensor<64x256x1024xf32,
  #ttcore.metal_layout<(d0, d1) -> (d0, d1),
    undef,
    <2x4x16>,
    memref<32x2x2x!ttcore.tile<32 x 32, bfp_bf8>, #ttcore.memory_space<l1>>
  >
>

Pipeline Parallel

Pipeline parallel in the scope of this spec isn't particularly interesting, it is intended to be used in conjunction with the ttir.pipeline operation which will group sections of the module's operations into groups to form pipeline regions and will be covered in a separate spec.

What we can demonstrate here is how we can take multiple non-overlapping views of the system descriptor to form distinct virtual devices.

Given an 8 chip system [8, 8x8], we can form two virtual devices that each take 4 chips and interpret them differently (though they could take the same logical grid).

#ttcore.device<
  workerGrid = #ttcore.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
  meshShape = 2x1x2,
  chipIds = [0, 1, 2, 3]
>
#ttcore.device<
  workerGrid = #ttcore.grid<16x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0 mod 8, d1 mod 8)>,
  meshShape = 2x2,
  chipIds = [4, 5, 6, 7]
>

Reinterpreted Grids (Transpose)

One particularly interesting usecase that logical grids could enable is to reinterpret the grid as a form of data layout optimization. For example, if we wanted to transpose a tensor, instead of having to move the data around to implement transpose, we could instead reinterpret the grid as being transposed, leveraging the fact that the relevant data is already located on the correct cores/chips.

To keep things simple, let's consider a 1 chip system [1, 8x8], but it's not too big a leap to see how this could map to multi-chip where the cost of moving data is even higher.

Let's also consider a simple (totally contrived) eltwise unary graph:

a = exp(a)
aT = transpose(a)
relu(aT)

We'll establish a regular, single chip, identity logical grid:

#ttcore.device<
  workerGrid = #ttcore.grid<8x8, (d0, d1) -> (0, d0, d1)>,
  meshShape = 1,
  chipIds = [0]
>

Execute exp.
We'll reinterpret the grid as transposed:

#ttcore.device<
  workerGrid = #ttcore.grid<8x8, (d0, d1) -> (0, d1, d0)>,
  meshShape = 1,
  chipIds = [0]
>

Execute transpose. Note that each core only needs to transpose their data locally. Eventually this could be implemented as a no-op by reindexing the tile visitation order of the successive operation.
Execute relu.

It's important to note that we effectively implemented transpose without moving data anywhere.

Reinterpreted Grids (Extra)

For the sake of examples, here's a few more ways of reinterpreting the logical grid.

Extra Wide Grid

#ttcore.device<
  workerGrid = #ttcore.grid<1x64, (d0, d1) -> (0, d0 * 8 + d1 floordiv 8, d1 mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

Extra Tall + Transposed Grid

#ttcore.device<
  workerGrid = #ttcore.grid<64x1, (d0, d1) -> (0, d1 * 8 + d0 floordiv 8, d0 mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

Staircase

#ttcore.device<
  workerGrid = #ttcore.grid<8x8, (d0, d1) -> (0, d0, (d0 + d1) mod 8)>,
  meshShape = 1,
  chipIds = [0]
>

This could be an interesting starting position for data in implementing matmul as a systolic array in a ring topology.

Lowering to TTNN

While the above device attribute encoding is quite flexible, this does not necessarily mean the target backend can actually support all of these interpretations. TTNN backend will be constrained to support only the specialized grid topologies that are supported by the API.

Grid/Shard Orientation

TODO

Multi-device

Please refer to TTNN Mesh Programming Docs for more information on how to program multi-device systems with TTNN API.

Multi-device TTNN dialect will try and stay as close to the TTNN API as possible. Let's consider what this looks like from the compiler and runtime perspectives:

Compiler

Device Creation: The TTNN device in the compiler is exactly the same attribute from the ttir dialect. It will encode the meshShape into the flatbuffer which can be directly used to program ::ttnn::MeshShape.
Tensor Layout: Again, the tensor layout is inherited in TTNN dialect from the ttir dialect. The grid attribute in the tensor layout can be trivially divided by meshShape to determine the shape of the tensor slice on each device. Broadcasting rules can be applied to determine which Distribution Strategy to use:
- Mesh Sharded: If the tensor grid is > 1 along the meshShape dimensions, the tensor will be sharded across the mesh devices.
- Replication: If the tensor needs to be broadcasted for this op, by extension the tensor layout will be replicated across the mesh devices.

Runtime

Device Creation: The ttnn runtime will wholesale switch to working with mesh devices via api ttnn::multi_device::open_mesh_device, this is possible because a 1x1 mesh device is a valid single device. The mesh shape during device open will always be 1xN where N is the number of deviceIds in the array. Note that this shape can be reinterpreted by flatbuffer programs on the fly with SubMesh API.
Tensor Creation: Tensor creation in a multi-device system is a bit more involved. In order to upload a multi-device tensor to the mesh, the host tensor much first be created with MultiDeviceHostStorage. The ttnn runtime can automatically do this during handleToHostMemoryConfigOp:
- Regular host tensor will bounce through new tensor with MultiDeviceHostStorage type.
- tensor.to(mesh_device) will allocate/move the tensor to the mesh device.

Lowering to TTMetal

In TTMetal dialect we are only constrained by what we've implemented in the tt-mlir compiler, this means it is much more flexible and can theoretically support any of the grid interpretations above.

Test Plan

test/python/device_attr.py covers all of the examples above and asserts the IR is correctly generated.
Additional functional unit tests will be added as op and runtime support is added.

Concerns

ttcore.device is very flexible, but with this flexibility comes the potential for misuse. It's important that the compiler is able to validate the legal configurations of this attribute for the target backend.

'emitpy' Dialect

Dialect to generate Python from MLIR.

[TOC]

`emitpy.assign` (tt::emitpy::AssignOp)

Assign operation

The emitpy.assign operation represents a Python variable assignment. This models new_var = old_var or var = constant.

Example:

%2 = emitpy.assign %1 : <!emitpy.opaque<"ttnn.Tensor">>

// Code emitted for the operation above.
v2 = v1;

Operands:

Operand	Description
`value`	EmitPy opaque type

Results:

Result	Description
`result`	any type

`emitpy.call_opaque` (tt::emitpy::CallOpaqueOp)

Opaque call operation

Syntax:

operation ::= `emitpy.call_opaque` $callee `(` $operands `)` attr-dict `:` functional-type($operands, results)

The emitpy.call_opaque operation represents a Python function call. The callee can be an arbitrary non-empty string.

Example:

// Custom form defining a call to `foo()`.
%2 = emitpy.call_opaque "ttnn.add"(%0, %1) {args = [0 : index, 1 : index]} : (!emitpy.opaque<"ttnn.Tensor">, !emitpy.opaque<"ttnn.Tensor">) -> !emitpy.opaque<"ttnn.Tensor">

Attributes:

Attribute	MLIR Type	Description
`callee`	::mlir::StringAttr	string attribute
`args`	::mlir::ArrayAttr	array attribute

Operands:

Operand	Description
`operands`	variadic of any type

Results:

Result	Description
«unnamed»	variadic of any type

`emitpy.constant` (tt::emitpy::ConstantOp)

Constant operation

The emitpy.constant operation produces an SSA value equal to some constant specified by an attribute. This can be used to form simple integer and floating point constants, as well as more exotic things like tensor constants.

Traits: ConstantLike

Attributes:

Attribute	MLIR Type	Description
`value`	::mlir::Attribute	An opaque attribute or TypedAttr instance

Results:

Result	Description
`result`	any type

`emitpy.import` (tt::emitpy::ImportOp)

Import operation

The emitpy.import operation allows to define a Python module import via various forms of the import statement.

Example:

 emitpy.import import "ttnn"

Attributes:

Attribute	MLIR Type	Description
`module_name`	::mlir::StringAttr	string attribute
`module_alias`	::mlir::StringAttr	string attribute
`members_to_import`	::mlir::ArrayAttr	string array attribute
`member_aliases`	::mlir::ArrayAttr	string array attribute
`import_all`	::mlir::UnitAttr	unit attribute

`emitpy.literal` (tt::emitpy::LiteralOp)

Literal operation

Syntax:

operation ::= `emitpy.literal` $value attr-dict `:` type($result)

The emitpy.literal operation produces an SSA value equal to some constant specified by an attribute.

Example:

%0 = emitpy.literal "0" : index

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`value`	::mlir::StringAttr	string attribute

Results:

Result	Description
`result`	index

`emitpy.subscript` (tt::emitpy::SubscriptOp)

Subscript operation

Syntax:

operation ::= `emitpy.subscript` $value `[` $index `]` attr-dict `:` functional-type(operands, results)

With the emitpy.subscript operation the subscript operator [] can be applied to variables or arguments of opaque type.

Example:

%0 = emitpy.literal "0" : index
%1 = emitpy.subscript %arg0[%0] : (!emitpy.opaque<"[ttnn.Tensor]">, index) -> !emitpy.opaque<"ttnn.Tensor">

Operands:

Operand	Description
`value`	EmitPy opaque type
`index`	index

Results:

Result	Description
`result`	EmitPy opaque type

`emitpy.verbatim` (tt::emitpy::VerbatimOp)

Verbatim operation

Syntax:

operation ::= `emitpy.verbatim` $value (`args` $fmtArgs^ `:` type($fmtArgs))? attr-dict

The emitpy.verbatim operation produces no results and the value is emitted as is followed by a line break ('\n' character) during translation.

This operation can be used in situations where a more suitable operation is not yet implemented in the dialect.

Note: Use with caution. This operation can have arbitrary effects on the semantics of the emitted code. Use semantically more meaningful operations whenever possible. Additionally this op is NOT intended to be used to inject large snippets of code.

Attributes:

Attribute	MLIR Type	Description
`value`	::mlir::StringAttr	string attribute

Operands:

Operand	Description
`fmtArgs`	variadic of any type

'ttcore' Dialect

TT core types and attributes common to all TT dialects.

This dialect defines types and attributes common to all TT dialects.

[TOC]

ArchAttr

TT Arch

Syntax:

#ttcore.arch<
  ::mlir::tt::ttcore::Arch   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttcore::Arch`	an enum of type Arch

ArgumentAllocationAttr

Argument allocation attribute in TT dialect

Syntax:

#ttcore.arg_alloc<
  uint64_t,   # address
  uint64_t,   # size
  MemorySpace   # memorySpace
>

Holds the metadata for the allocation of an function argument i.e. for graph inputs.

Parameters:

Parameter	C++ type	Description
address	`uint64_t`
size	`uint64_t`
memorySpace	`MemorySpace`

ArgumentTypeAttr

Argument Type

Syntax:

#ttcore.argument_type<
  ::mlir::tt::ttcore::ArgumentType   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttcore::ArgumentType`	an enum of type ArgumentType

CPUDescAttr

TT cpu_desc attribute

Syntax:

#ttcore.cpu_desc<
  CPURole,   # role
  StringAttr   # target_triple
>

TT cpu_desc attribute

Parameters:

Parameter	C++ type	Description
role	`CPURole`
target_triple	`StringAttr`

CPURoleAttr

TT CPU Role

Syntax:

#ttcore.cpu_role<
  ::mlir::tt::ttcore::CPURole   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttcore::CPURole`	an enum of type CPURole

ChipChannelAttr

TT chip_channel attribute

Syntax:

#ttcore.chip_channel<
  unsigned,   # deviceId0
  ::llvm::ArrayRef<int64_t>,   # ethernetCoreCoord0
  unsigned,   # deviceId1
  ::llvm::ArrayRef<int64_t>   # ethernetCoreCoord1
>

TT chip_channel attribute

Parameters:

Parameter	C++ type	Description
deviceId0	`unsigned`
ethernetCoreCoord0	`::llvm::ArrayRef<int64_t>`
deviceId1	`unsigned`
ethernetCoreCoord1	`::llvm::ArrayRef<int64_t>`

ChipCoordAttr

TT chip_coord attribute

Syntax:

#ttcore.chip_coord<
  unsigned,   # rack
  unsigned,   # shelf
  unsigned,   # y
  unsigned   # x
>

TT chip_coord attribute

Parameters:

Parameter	C++ type	Description
rack	`unsigned`
shelf	`unsigned`
y	`unsigned`
x	`unsigned`

ChipDescAttr

TT chip_desc attribute

Syntax:

#ttcore.chip_desc<
  ArchAttr,   # arch
  ::llvm::ArrayRef<int64_t>,   # grid
  ::llvm::ArrayRef<int64_t>,   # coordTranslationOffsets
  unsigned,   # l1Size
  unsigned,   # numDramChannels
  unsigned,   # dramChannelSize
  unsigned,   # nocL1AddressAlignBytes
  unsigned,   # pcieAddressAlignBytes
  unsigned,   # nocDRAMAddressAlignBytes
  unsigned,   # l1UnreservedBase
  unsigned,   # eriscL1UnreservedBase
  unsigned,   # dramUnreservedBase
  unsigned,   # dramUnreservedEnd
  ::llvm::ArrayRef<DataTypeAttr>,   # supportedDataTypes
  ::llvm::ArrayRef<TileSizeAttr>,   # supportedTileSizes
  unsigned,   # dstRegisterSizeTiles
  unsigned,   # numCBs
  unsigned,   # numComputeThreads
  unsigned   # numDatamovementThreads
>

TT chip_desc attribute

Parameters:

Parameter	C++ type	Description
arch	`ArchAttr`
grid	`::llvm::ArrayRef<int64_t>`
coordTranslationOffsets	`::llvm::ArrayRef<int64_t>`
l1Size	`unsigned`
numDramChannels	`unsigned`
dramChannelSize	`unsigned`
nocL1AddressAlignBytes	`unsigned`
pcieAddressAlignBytes	`unsigned`
nocDRAMAddressAlignBytes	`unsigned`
l1UnreservedBase	`unsigned`
eriscL1UnreservedBase	`unsigned`
dramUnreservedBase	`unsigned`
dramUnreservedEnd	`unsigned`
supportedDataTypes	`::llvm::ArrayRef<DataTypeAttr>`
supportedTileSizes	`::llvm::ArrayRef<TileSizeAttr>`
dstRegisterSizeTiles	`unsigned`
numCBs	`unsigned`
numComputeThreads	`unsigned`
numDatamovementThreads	`unsigned`

CoreCoordAttr

TT core_coord attribute

Syntax:

#ttcore.core_coord<
  int64_t,   # y
  int64_t   # x
>

TT core_coord attribute containing a single physical core coordinate.

Parameters:

Parameter	C++ type	Description
y	`int64_t`
x	`int64_t`

DataTypeAttr

TT DataTypes

Syntax:

#ttcore.supportedDataTypes<
  ::mlir::tt::ttcore::DataType   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttcore::DataType`	an enum of type DataType

DeviceAttr

Device attribute in TT dialect.

Syntax:

#ttcore.device<
  ::mlir::tt::ttcore::GridAttr,   # workerGrid
  AffineMap,   # l1Map
  AffineMap,   # dramMap
  ::llvm::ArrayRef<int64_t>,   # meshShape
  ::llvm::ArrayRef<unsigned>   # chipIds
>

Describes the physical layout of a device in the system and is made up of a few components:

A grid attribute that describes the device's compute grid shape. It not only describes the shape of the compute grid, but also carries an affine map that describes how the logical grid maps to the physical grid.
Two affine maps that describe how a tensor layout's linear attribute maps to the L1 and DRAM memory spaces.
A mesh shape that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.
An array of chip ids that this device is made up of. This array's length must match the volume of the mesh shape and should be interpreted in row-major order.

Parameters:

Parameter	C++ type	Description
workerGrid	`::mlir::tt::ttcore::GridAttr`	TT grid attribute
l1Map	`AffineMap`
dramMap	`AffineMap`
meshShape	`::llvm::ArrayRef<int64_t>`
chipIds	`::llvm::ArrayRef<unsigned>`

GridAttr

TT grid attribute

Syntax:

#ttcore.grid<
  ::llvm::ArrayRef<int64_t>,   # shape
  AffineMap   # mapping
>

TT grid attribute

Parameters:

Parameter	C++ type	Description
shape	`::llvm::ArrayRef<int64_t>`
mapping	`AffineMap`

IteratorTypeAttr

TT IteratorType

Syntax:

#ttcore.iterator_type<
  ::mlir::tt::ttcore::IteratorType   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttcore::IteratorType`	an enum of type IteratorType

MemorySpaceAttr

TT MemorySpace

Syntax:

#ttcore.memory_space<
  ::mlir::tt::ttcore::MemorySpace   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttcore::MemorySpace`	an enum of type MemorySpace

MeshAttr

Mesh reference attribute in TT dialect.

Syntax:

#ttcore.mesh<
  StringAttr,   # name
  ::llvm::ArrayRef<int64_t>   # shape
>

Describes a mesh config including name and shape.

Parameters:

Parameter	C++ type	Description
name	`StringAttr`
shape	`::llvm::ArrayRef<int64_t>`

MeshShardDirectionAttr

TT MeshShardDirection

Syntax:

#ttcore.shard_direction<
  ::mlir::tt::ttcore::MeshShardDirection   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttcore::MeshShardDirection`	an enum of type MeshShardDirection

MeshShardTypeAttr

MeshShard shard_type attribute in TT dialect

Syntax:

#ttcore.shard_type<
  ::mlir::tt::ttcore::MeshShardType   # value
>

Define sharded tensor data of mesh_shard op.

Identity: input and output tensors are pre-sharded (same data) and no sharding is required.
Replicate: all of the devices has full tensor (same data).
Maximal: one or part of the devcices has full tensor (same data).
Devices: all or part of the devices has sharded (partial) tensor (different data).

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttcore::MeshShardType`	an enum of type MeshShardType

MeshesAttr

TT system meshes attribute.

Syntax:

#ttcore.meshes<
  ::llvm::ArrayRef<MeshAttr>   # meshes
>

TT system meshes attribute includes one or more mesh configs used for networks.

Parameters:

Parameter	C++ type	Description
meshes	`::llvm::ArrayRef<MeshAttr>`

MetalLayoutAttr

Tensor layout attribute with explicit physical shape

Syntax:

#ttcore.metal_layout<
  ::llvm::ArrayRef<int64_t>,   # logical_shape
  ::llvm::ArrayRef<int64_t>,   # dim_alignments
  DenseIntElementsAttr,   # collapsed_intervals
  OOBVal,   # oob_val
  MemorySpace   # memory_space
>

The tensor layout attribute captures how tensor data is sharded across a grid of devices/cores and is laid out in memory. Note that the presence of this attribute implies that the tensor shape includes sharding (i.e. the first half of the tensor shape represents the grid shape).

Some high level goals:

Logical shapes: Store the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
Collapsed dims: We may collapse dimensions during transformation, but it is important we capture this information such that it is not lost during tensor transformation. The collapsed_intervals field stores the collapses performed during conversion from logical_shape to physical tensor shape.
Padding: store the desired alignments s.t. padding can be simply encoded; dim_alignments field represents alignment along each logical dim during collapse.
Memref translation: ensure we have all necessary info s.t. we can trivally lower a tensor into a memref without any intermediate passes.

For a logical tensor of shape [H, W] distributed across a grid [GY, GX], the tensor shape would be:

Without tiling: [GY, GX, H/GY, W/GX]
With tiling: [GY, GX, H/GY/TH, W/GX/TW, TH, TW] where TH,TW are tile dimensions

This makes the representation 1:1 with memrefs and eliminates the need for shape conversion passes.

Examples:

// Logical 8x300 tensor distributed across 1x2 grid:
// tensor<1x2x8x150xf32, #tt.metal_layout<logical_shape=8x300, ...>>

// Logical 1024x1024 tensor distributed across 2x2 grid with 32x32 tiles:
// tensor<2x2x16x16x!ttcore.tile<32x32xf32>, #tt.metal_layout<logical_shape=1024x1024, ...>>

Parameters:

Parameter	C++ type	Description
logical_shape	`::llvm::ArrayRef<int64_t>`
dim_alignments	`::llvm::ArrayRef<int64_t>`
collapsed_intervals	`DenseIntElementsAttr`	Intervals of dims to collpase
oob_val	`OOBVal`	Out of bounds fill value
memory_space	`MemorySpace`

OOBValAttr

TT OOBVal

Syntax:

#ttcore.oob_val<
  ::mlir::tt::ttcore::OOBVal   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttcore::OOBVal`	an enum of type OOBVal

ReduceTypeAttr

TT Reduce Type

Syntax:

#ttcore.reduce_type<
  ::mlir::tt::ttcore::ReduceType   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttcore::ReduceType`	an enum of type ReduceType

ShardLayoutAttr

Shard layout attribute in TT dialect

Syntax:

#ttcore.shard<
  ::llvm::ArrayRef<int64_t>,   # stride
  uint32_t   # buffers
>

Describes shard layout of a memref buffer.

Stride: Stride of each dim in bytes.
Buffers: Number of back buffers used for double buffering, I/O latency hiding, etc

The shard layout attribute is a description of how each shard of a memref is laid out in memory. Memref's with this layout type implicitly mean their data is distributed across a grid of cores.

Parameters:

Parameter	C++ type	Description
stride	`::llvm::ArrayRef<int64_t>`
buffers	`uint32_t`

ShardStatusAttr

Shard Status Type

Syntax:

#ttcore.shard_status<
  ::mlir::tt::ttcore::ShardStatus   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttcore::ShardStatus`	an enum of type ShardStatus

SystemDescAttr

TT system_desc attribute

Syntax:

#ttcore.system_desc<
  ::llvm::ArrayRef<CPUDescAttr>,   # cpuDescs
  ::llvm::ArrayRef<ChipDescAttr>,   # chipDescs
  ::llvm::ArrayRef<unsigned>,   # chipDescIndices
  ::llvm::ArrayRef<ChipCapabilityAttr>,   # chipCapabilities
  ::llvm::ArrayRef<ChipCoordAttr>,   # chipCoords
  ::llvm::ArrayRef<ChipChannelAttr>   # chipChannels
>

TT system_desc attribute

Parameters:

Parameter	C++ type	Description
cpuDescs	`::llvm::ArrayRef<CPUDescAttr>`
chipDescs	`::llvm::ArrayRef<ChipDescAttr>`
chipDescIndices	`::llvm::ArrayRef<unsigned>`
chipCapabilities	`::llvm::ArrayRef<ChipCapabilityAttr>`
chipCoords	`::llvm::ArrayRef<ChipCoordAttr>`
chipChannels	`::llvm::ArrayRef<ChipChannelAttr>`

TensorMeshAttr

Tensor mesh in TT dialect.

Syntax:

#ttcore.tensor_mesh<
  StringAttr   # name
>

Describes what mesh a tensor lives on.

Parameters:

Parameter	C++ type	Description
name	`StringAttr`

TileSizeAttr

TT tile_size attribute

Syntax:

#ttcore.tile_size<
  int64_t,   # y
  int64_t   # x
>

TT tile_size attribute containing a supported Tensix tile shape.

Parameters:

Parameter	C++ type	Description
y	`int64_t`
x	`int64_t`

ViewLayoutAttr

View layout attribute in TT dialect

Syntax:

#ttcore.view<
  AffineMap   # affineMap
>

Describes a view layout of a memref buffer.

AffineMap: Provides affine map indexing into the associated data view.

Only the view_layout or stream_layout ops should return memref's with this attribute. The view layout attribute is necessary for two reasons:

It provides a way to reblock the data view into a different shape (via affine map). Usually this would be some subblock of the original backing memory to chunk the data into smaller pieces.
The type itself is a signal to datamovement passes that the memref is a view and should be treated as such.

Parameters:

Parameter	C++ type	Description
affineMap	`AffineMap`

`ttcore.cpu_module` (tt::ttcore::CPUModuleOp)

Module-wrapper operation for CPU ops

Syntax:

operation ::= `ttcore.cpu_module` attr-dict-with-keyword regions

Custom module operation that can a single ModuleOp, which should contain all funcs which should be run on CPU.

Example:

ttcore.cpu_module {
  module {
    func.func foo() { ... }
  }
}

Traits: IsolatedFromAbove, NoRegionArguments, NoTerminator, SingleBlock, SymbolTable

`ttcore.device_module` (tt::ttcore::DeviceModuleOp)

Module-wrapper operation for device ops

Syntax:

operation ::= `ttcore.device_module` attr-dict-with-keyword $bodyRegion

Custom module operation that can a single ModuleOp, which should contain all funcs which should be run on device.

Example:

ttcore.device_module {
  module {
    func.func foo() { ... }
  }
}

Traits: IsolatedFromAbove, NoRegionArguments, NoTerminator, SingleBlock, SymbolTable

`ttcore.device` (tt::ttcore::DeviceOp)

Named device

Syntax:

operation ::= `ttcore.device` $sym_name `=` $device_attr attr-dict

Interfaces: Symbol

Attributes:

Attribute	MLIR Type	Description
`sym_name`	::mlir::StringAttr	string attribute
`device_attr`	::mlir::tt::ttcore::DeviceAttr	Device attribute in TT dialect. {{% markdown %}} Describes the physical layout of a device in the system and is made up of a few components: - A grid attribute that describes the device's compute grid shape. It not only describes the shape of the compute grid, but also carries an affine map that describes how the logical grid maps to the physical grid. - Two affine maps that describe how a tensor layout's linear attribute maps to the L1 and DRAM memory spaces. - A mesh shape that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology. - An array of chip ids that this device is made up of. This array's length must match the volume of the mesh shape and should be interpreted in row-major order. {{% /markdown %}}

`ttcore.get_tuple_element` (tt::ttcore::GetTupleElementOp)

GetTupleElement operation

Syntax:

operation ::= `ttcore.get_tuple_element` $operand `[` $index `]` attr-dict `:` functional-type(operands, results)

Extracts element at index position of the operand tuple and produces a result.

Example:

%result = ttcore.get_tuple_element %operand[0] : (tuple<tensor<32x32xbf16>, tensor<1x32xf32>>) -> tensor<32x32xbf16>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`index`	::mlir::IntegerAttr	32-bit signless integer attribute whose value is non-negative

Operands:

Operand	Description
`operand`	nested tuple with any combination of ranked tensor of any type values values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttcore.load_cached` (tt::ttcore::LoadCachedOp)

Load cached results from a previously computed function

Syntax:

operation ::= `ttcore.load_cached` `(` $callee `,` `[` $inputs `]` `)` attr-dict `:` functional-type($inputs, $results)

The load_cached operation calls a precomputed function with given arguments and returns its results. This is typically used to load constant or hoisted computation results.

Example:

%0, %1, %2 = "ttcore.load_cached"(@forward_const_eval_1, [%arg0, %arg2])

Attributes:

Attribute	MLIR Type	Description
`callee`	::mlir::FlatSymbolRefAttr	flat symbol reference attribute

Operands:

Operand	Description
`inputs`	variadic of ranked tensor of any type values

Results:

Result	Description
`results`	variadic of ranked tensor of any type values

`ttcore.tuple` (tt::ttcore::TupleOp)

Tuple operation

Syntax:

operation ::= `ttcore.tuple` $operands attr-dict `:` custom<TupleOpType>(type($operands), type($result))

Produces a result tuple from operands operands.

Example:

%result = ttcore.tuple %operand0, %operand1 : tuple<tensor<32xbf16, tensor<1x32xf32>>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`operands`	variadic of ranked tensor of any type values

Results:

Result	Description
`result`	nested tuple with any combination of ranked tensor of any type values values

TileType

TT tile

Syntax:

!ttcore.tile<
  ::llvm::ArrayRef<int64_t>,   # shape
  DataType   # dataType
>

Tile type in TT dialect

Parameters:

Parameter	C++ type	Description
shape	`::llvm::ArrayRef<int64_t>`
dataType	`DataType`

'ttir' Dialect

TTIR dialect provides high level semantics for dispatching work to TT HW.

This dialect provides high level semantics for dispatching work to TT HW. It defines a set of declarative/high level operations that are used to describe the dispatch, but is largely agnostic to the set of operations or dialects that are actually supported by a consuming backend.

[TOC]

`ttir.abs` (tt::ttir::AbsOp)

Elementwise absolute value operation.

The abs operation computes the absolute value of each element in the input tensor.

For each element, it returns the magnitude of the value without regard to its sign:

For real numbers, it returns |x| (the non-negative value without sign)

This operation has the idempotence property, meaning that applying it multiple times produces the same result as applying it once: abs(abs(x)) = abs(x). The operation preserves the data type of the input.

Example:

// Compute absolute values of all elements in %input
%result = ttir.abs(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[-2.5,  3.7,  0.0,  1.2], ... ]
// Output tensor:
// [[2.5, 3.7, 0.0, 1.2], ... ]

// Example with integer tensor
%result = ttir.abs(%int_input, %int_output) : tensor<10xi32>, tensor<10xi32> -> tensor<10xi32>
// Input tensor:
// [-5, 0, 3, -2, ...]
// Output tensor:
// [5, 0, 3, 2, ...]

Mathematical definition: abs(x) = |x| = { x if x ≥ 0 -x if x < 0 }

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Idempotence, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.add` (tt::ttir::AddOp)

Elementwise addition operation.

The add operation performs an elementwise addition between two tensors.

For each pair of corresponding elements, it adds the elements and places the result in the output tensor.

Example:

// Addition operation
%result = ttir.add(%lhs, %rhs, %output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensors:
// %lhs: [10, 20, 30]
// %rhs: [1, 2, 3]
// Output tensor:
// [11, 22, 33]

// Example with floating point values
%result = ttir.add(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [3.5, 0.0, -1.2]
// %float_rhs: [1.5, 2.0, -3.2]
// Output tensor:
// [5.0, 2.0, -2.0]

Note: The data type of the output tensor matches the data type of the input tensors.

Mathematical definition: add(x, y) = x + y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), QuantizableOpInterface, TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.all_gather` (tt::ttir::AllGatherOp)

All gather operation.

All gather op.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`all_gather_dim`	::mlir::IntegerAttr	32-bit signed integer attribute
`cluster_axis`	::mlir::IntegerAttr	32-bit unsigned integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.all_reduce` (tt::ttir::AllReduceOp)

AllReduce operation.

AllReduce op.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`reduce_type`	::mlir::tt::ttcore::ReduceTypeAttr	TT Reduce Type
`cluster_axis`	::mlir::IntegerAttr	32-bit unsigned integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.all_to_all` (tt::ttir::AllToAllOp)

All to All operation.

The all_to_all operation redistributes slices of a tensor across a cluster of devices. It splits each local tensor along split_dimension, sends the resulting slices to other devices along cluster_axis, and then concatenates the received slices along concat_dimension.

Example: For a 1x2 mesh and a local input of shape [8, 4]: - split_dimension = 1 - concat_dimension = 0 - split_count = 2 - cluster_axis = 1

Each device splits its [8, 4] tensor into two [8, 2] slices. After the exchange, each device concatenates the two received [8, 2] slices into a [16, 2] output tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`split_dim`	::mlir::IntegerAttr	32-bit signed integer attribute
`concat_dim`	::mlir::IntegerAttr	32-bit signed integer attribute
`split_count`	::mlir::IntegerAttr	32-bit signed integer attribute
`replica_groups`	::mlir::DenseIntElementsAttr	64-bit signless integer elements attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.alloc` (tt::ttir::AllocOp)

Alloc op.

Tensor Alloc operation

Attributes:

Attribute	MLIR Type	Description
`address`	::mlir::IntegerAttr	64-bit signless integer attribute
`size`	::mlir::IntegerAttr	64-bit signless integer attribute
`memory_space`	::mlir::tt::ttcore::MemorySpaceAttr	TT MemorySpace

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.arange` (tt::ttir::ArangeOp)

Tensor range generation operation.

The arange operation generates a tensor with evenly spaced values within a given interval.

This operation creates a tensor with values from start to end (exclusive) with a step size of step, along the dimension specified by arange_dimension. It's similar to NumPy's arange function and is useful for creating tensors with regular sequences of values.

Example:

// Generate a 1D tensor with values [0, 1, 2, 3, 4]
%result = ttir.arange() {
    start = 0 : si64,
    end = 5 : si64,
    step = 1 : si64,
    arange_dimension = 0 : i64
} : () -> tensor<5xi64>

// Generate a 1D tensor with values [0.0, 2.0, 4.0, 6.0, 8.0]
%result = ttir.arange() {
    start = 0 : si64,
    end = 10 : si64,
    step = 2 : si64,
    arange_dimension = 0 : i64
} : () -> tensor<5xf32>

// Generate a 2D tensor with the sequence along dimension 0
%result = ttir.arange() {
    start = 0 : si64,
    end = 5 : si64,
    step = 1 : si64,
    arange_dimension = 0 : i64
} : () -> tensor<5x3xi64>
// Result:
// [[0, 0, 0],
//  [1, 1, 1],
//  [2, 2, 2],
//  [3, 3, 3],
//  [4, 4, 4]]

// Generate a 2D tensor with the sequence along dimension 1
%result = ttir.arange() {
    start = 0 : si64,
    end = 3 : si64,
    step = 1 : si64,
    arange_dimension = 1 : i64
} : () -> tensor<5x3xi64>
// Result:
// [[0, 1, 2],
//  [0, 1, 2],
//  [0, 1, 2],
//  [0, 1, 2],
//  [0, 1, 2]]

Attributes:

start (Integer): The start value of the sequence.
end (Integer): The end value of the sequence (exclusive).
step (Integer): The step size between values in the sequence.
arange_dimension (Integer): The dimension along which to generate the sequence.

Outputs:

result (Tensor): The generated tensor containing the sequence.

Traits: AlwaysSpeculatableImplTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`start`	::mlir::IntegerAttr	64-bit signed integer attribute
`end`	::mlir::IntegerAttr	64-bit signed integer attribute
`step`	::mlir::IntegerAttr	64-bit signed integer attribute
`arange_dimension`	::mlir::IntegerAttr	64-bit signless integer attribute

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.argmax` (tt::ttir::ArgMaxOp)

Argmax reduction op.

Determine the indices of the maximum values along a specified dimension of a tensor or over all elements in a tensor.

This operation reduces the input tensor by finding the index of the maximum value along the dimensions specified in dim_arg. If dim_arg is not provided, the argmax is computed over all dimensions, resulting in a scalar index. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example IR Usage:

// Argmax along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2xi32>
%result = ttir.argmax(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xf32>, tensor<2xi32> -> tensor<2xi32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [2.0, 4.0, 6.0]]
// Output tensor:
// [1, 2]  // Index of maximum value in each row (5.0 in first row, 6.0 in second row)

// Argmax along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<3xi32>
%result = ttir.argmax(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xf32>, tensor<3xi32> -> tensor<3xi32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [2.0, 4.0, 6.0]]
// Output tensor:
// [1, 0, 1]  // Index of maximum value in each column

// Argmax over all dimensions
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<i32>
%result = ttir.argmax(%input, %output) {keep_dim = false} : tensor<2x3xf32>, tensor<i32> -> tensor<i32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [2.0, 4.0, 6.0]]
// Output tensor:
// 5  // Flattened index of the maximum value (6.0)

Inputs:

input (Tensor): The input tensor.

Attributes:

keep_dim (Bool): Whether to keep the reduced dimensions or not.
dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.atan2` (tt::ttir::Atan2Op)

Elementwise atan2 operation.

The atan2 operation performs an elementwise arc tangent (inverse tangent) operation between two tensors.

For each pair of corresponding elements, it computes the angle in radians between the positive x-axis and the vector from the origin to the point (x, y) in the Cartesian plane. This operation is typically used in trigonometric calculations and supports partial broadcasting, allowing operands of different shapes to be combined.

Example:

// %lhs: [0.0, 1.0, -1.0]
// %rhs: [1.0, 0.0, 0.0]
%result = ttir.atan2(%lhs, %rhs, %output) : tensor<3xf64>, tensor<3xf64>, tensor<3xf64> -> tensor<3xf64>
// %result: [0.0, 1.57079637, -1.57079637] // [0.0, pi/2, -pi/2]

Mathematical definition: atan2(x, y) = arctan(y / x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.atan` (tt::ttir::AtanOp)

Eltwise arctangent op.

The atan operation computes the arctangent (inverse tangent) of each element in the input tensor.

For each element, it returns the angle in radians whose tangent is the input value. The operation returns values in the range [-π/2, π/2].

Example:

// Compute arctangent of all elements in %input
%result = ttir.atan(%input, %output) : tensor<4xf32>, tensor<4xf32> -> tensor<4xf32>
// Input tensor:
// [1.0, 0.5, 0.0, -1.0]
// Output tensor:
// [0.785, 0.464, 0.0, -0.785]  // values in radians

// Example with different values
%result = ttir.atan(%float_input, %float_output) : tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [0.0, 1.0, 1000.0]
// Output tensor:
// [0.0, 0.785, 1.571]  // values approach π/2 as input grows

Mathematical definition: atan(x) = tan⁻¹(x), where the result is in the range [-π/2, π/2]

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.avg_pool2d` (tt::ttir::AvgPool2dOp)

2D average pooling operation.

The avg_pool2d operation applies a 2D average pooling over an input tensor composed of several input planes.

This operation performs downsampling by dividing the input into local regions and computing the average value of each region. It reduces the spatial dimensions (height and width) of an input tensor while preserving the batch and channel dimensions. This is commonly used in neural networks to reduce the spatial size of feature maps.

Example:

// Basic 2D average pooling with a 2x2 kernel and stride 1
%input = ... : tensor<1x3x3x1xf32>  // 3x3 input tensor with values:
                                    // [[[1, 2, 3],
                                    //   [4, 5, 6],
                                    //   [7, 8, 9]]]]
%output = ttir.empty() : tensor<1x2x2x1xf32>
%result = ttir.avg_pool2d(%input, %output) {
    kernel = [2, 2],
    stride = [1, 1],
    dilation = [1, 1],
    padding = [0, 0, 0, 0],
    ceil_mode = false
} : tensor<1x3x3x1xf32>, tensor<1x2x2x1xf32> -> tensor<1x2x2x1xf32>
// Result: [[[3, 4],
//           [6, 7]]]]
// Where: 3 = (1+2+4+5)/4, 4 = (2+3+5+6)/4, 6 = (4+5+7+8)/4, 7 = (5+6+8+9)/4

Inputs:

input (Tensor): Input tensor in NHWC format (batch, height, width, channels).

Attributes:

kernel (i32 | array<2xi32>):
- i32: Same kernel size for height and width dimensions (kH = kW = value).
- array<2xi32>: [kH, kW] where kH is kernel size for height and kW is kernel size for width.
stride (i32 | array<2xi32>):
- i32: Same stride for height and width dimensions (sH = sW = value).
- array<2xi32>: [sH, sW] where sH is stride for height and sW is stride for width.
dilation (i32 | array<2xi32>):
- i32: Same dilation for height and width dimensions (dH = dW = value).
- array<2xi32>: [dH, dW] where dH is dilation for height and dW is dilation for width.
padding (i32 | array<2xi32> | array<4xi32>):
- i32: Same padding for all sides (pT = pL = pB = pR = value).
- array<2xi32>: [pH, pW] where pH is padding for height (top/bottom) and pW is padding for width (left/right).
- array<4xi32>: [pT, pL, pB, pR] for top, left, bottom, and right padding respectively.
ceil_mode (Boolean): When true, uses ceil instead of floor for output shape calculation.

Outputs:

result (Tensor): Output tensor after average pooling.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`kernel`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`stride`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`dilation`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`padding`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`ceil_mode`	::mlir::BoolAttr	bool attribute
`flattened_compat_info`	::mlir::tt::ttir::FlattenedCompatInfoAttr	Information for sliding window operations with tensors flattened to (1, 1, NHW, C) {{% markdown %}} This attribute marks operations that are compatible with flattened tensors. It is used as a marker and doesn't carry any additional data. {{% /markdown %}}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.batch_norm` (tt::ttir::BatchNormOp)

BatchNormInference operation

Performs batch normalization on the input tensor. Normalizes the operand tensor across all dimensions except for the specified dimension (feature dimension) and produces the normalized result.

Inputs:

operand (Tensor): The input tensor to be normalized.
scale (Tensor): The scale parameter (gamma).
offset (Tensor): The offset parameter (beta).
mean (Tensor): The pre-computed mean of the input.
variance (Tensor): The pre-computed variance of the input.

Attributes:

epsilon is a small constant added to variance for numerical stability.
dimension specifies which dimension represents the features/channels.
training (Bool): Whether the operation is in training mode.

Output:

result (Tensor): The normalized output tensor.

Example:

  // Normalize a batch of activations
  %result = ttir.batch_norm(%operand, %scale, %offset, %mean, %variance, %output,
                          epsilon = 0.001, dimension = 1, training = false) :
        (tensor<8x16x32x32xf32>, tensor<16xf32>, tensor<16xf32>,
          tensor<16xf32>, tensor<16xf32>, tensor<8x16x32x32xf32>) -> tensor<8x16x32x32xf32>

Mathematical definition: batch_norm(x, scale, offset, mean, variance, epsilon, dimension) = (x - mean) / sqrt(variance + epsilon) * scale + offset

Interfaces: DestinationStyleOpInterface, TTIROpInterface

Attributes:

Attribute	MLIR Type	Description
`epsilon`	::mlir::FloatAttr	32-bit float attribute
`dimension`	::mlir::IntegerAttr	32-bit signless integer attribute
`training`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`operand`	ranked tensor of any type values
`scale`	ranked tensor of any type values
`offset`	ranked tensor of any type values
`mean`	ranked tensor of any type values
`variance`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.bitwise_and` (tt::ttir::BitwiseAndOp)

Elementwise bitwise AND.

The bitwise_and operation performs an elementwise bitwise AND operation between two tensors.

For each pair of corresponding elements, it computes the bitwise AND of their binary representations. This operation is typically used with integer data types and has the idempotence property, meaning that applying it twice with the same second operand returns the original result: bitwise_and(bitwise_and(x, y), y) = bitwise_and(x, y).

Example:

// Bitwise AND operation
%result = ttir.bitwise_and(%lhs, %rhs, %output) : tensor<2x2xi32>, tensor<2x2xi32>, tensor<2x2xi32> -> tensor<2x2xi32>
// Input tensors:
// %lhs: [[1, 2], [3, 4]]
// %rhs: [[5, 6], [7, 8]]
// Output tensor:
// [[1, 2], [3, 0]]

// Example with binary representation (for 8-bit integers)
%result = ttir.bitwise_and(%int8_lhs, %int8_rhs, %int8_output) : tensor<4xi8>, tensor<4xi8>, tensor<4xi8> -> tensor<4xi8>
// Input tensors:
// %int8_lhs: [0x0F, 0xAA, 0xFF, 0x00]  (binary: [00001111, 10101010, 11111111, 00000000])
// %int8_rhs: [0xF0, 0x55, 0xFF, 0x00]  (binary: [11110000, 01010101, 11111111, 00000000])
// Output tensor:
// [0x00, 0x00, 0xFF, 0x00]  (binary: [00000000, 00000000, 11111111, 00000000])

Mathematical definition: bitwise_and(x, y) = x & y

Traits: AlwaysSpeculatableImplTrait, TTIR_BinaryIdempotence, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.bitwise_not` (tt::ttir::BitwiseNotOp)

Elementwise bitwise NOT.

The bitwise_not operation computes the bitwise NOT (one's complement) of each element in the input tensor.

For each element, it flips all the bits in the binary representation of the value. This operation is typically used with integer data types and has the involution property, meaning that applying it twice returns the original value: bitwise_not(bitwise_not(x)) = x.

Example:

// Bitwise operation with with integer tensors
%result = "ttir.bitwise_not"(%operand, %result) : (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32>
// %operand: [[1, 2], [3, 4]]
// %result: [[-2, -3], [-4, -5]]

// Example with binary representation (for 8-bit integers)
%result = ttir.bitwise_not(%int8_input, %int8_output) : tensor<3xi8>, tensor<3xi8> -> tensor<3xi8>
// Input %int8_input:
// [0, 5, 255]  (binary: [00000000, 00000101, 11111111])
// Output %int8_output:
// [255, 250, 0]  (binary: [11111111, 11111010, 00000000])

Mathematical definition: bitwise_not(x) = ~x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Involution, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.bitwise_or` (tt::ttir::BitwiseOrOp)

Elementwise bitwise OR operation.

The bitwise_or operation performs an elementwise bitwise OR operation between two tensors.

For each pair of corresponding elements, it computes the bitwise OR of their binary representations. This operation is typically used with integer data types and has the idempotence property, meaning that applying it twice with the same second operand returns the original result: bitwise_or(bitwise_or(x, y), y) = bitwise_or(x, y).

Example:

// Bitwise OR operation
%result = ttir.bitwise_or(%lhs, %rhs, %output) : tensor<2x2xi32>, tensor<2x2xi32>, tensor<2x2xi32> -> tensor<2x2xi32>
// Input tensors:
// %lhs: [[1, 2], [3, 4]]
// %rhs: [[5, 6], [7, 8]]
// Output tensor:
// [[5, 6], [7, 12]]

// Example with binary representation (for 8-bit integers)
%result = ttir.bitwise_or(%int8_lhs, %int8_rhs, %int8_output) : tensor<4xi8>, tensor<4xi8>, tensor<4xi8> -> tensor<4xi8>
// Input tensors:
// %int8_lhs: [0x0F, 0xAA, 0x00, 0x55]  (binary: [00001111, 10101010, 00000000, 01010101])
// %int8_rhs: [0xF0, 0x55, 0x00, 0xAA]  (binary: [11110000, 01010101, 00000000, 10101010])
// Output tensor:
// [0xFF, 0xFF, 0x00, 0xFF]  (binary: [11111111, 11111111, 00000000, 11111111])

Mathematical definition: bitwise_or(x, y) = x | y

Traits: AlwaysSpeculatableImplTrait, TTIR_BinaryIdempotence, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.bitwise_xor` (tt::ttir::BitwiseXorOp)

Elementwise bitwise XOR operation.

The bitwise_xor operation performs an elementwise bitwise XOR (exclusive OR) operation between two tensors.

For each pair of corresponding elements, it computes the bitwise XOR of their binary representations. This operation is typically used with integer data types and has the property that when applied twice with the same second operand, it returns the original input: bitwise_xor(bitwise_xor(x, y), y) = x.

Example:

// Bitwise XOR operation
%result = ttir.bitwise_xor(%lhs, %rhs, %output) : tensor<2x2xi32>, tensor<2x2xi32>, tensor<2x2xi32> -> tensor<2x2xi32>
// Input tensors:
// %lhs: [[1, 2], [3, 4]]
// %rhs: [[5, 6], [7, 8]]
// Output tensor:
// [[4, 4], [4, 12]]

// Example with binary representation (for 8-bit integers)
%result = ttir.bitwise_xor(%int8_lhs, %int8_rhs, %int8_output) : tensor<4xi8>, tensor<4xi8>, tensor<4xi8> -> tensor<4xi8>
// Input tensors:
// %int8_lhs: [0x0F, 0xAA, 0xFF, 0x00]  (binary: [00001111, 10101010, 11111111, 00000000])
// %int8_rhs: [0xF0, 0x55, 0xFF, 0x00]  (binary: [11110000, 01010101, 11111111, 00000000])
// Output tensor:
// [0xFF, 0xFF, 0x00, 0x00]  (binary: [11111111, 11111111, 00000000, 00000000])

Mathematical definition: bitwise_xor(x, y) = x ^ y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.broadcast` (tt::ttir::BroadcastOp)

Broadcast operation.

The broadcast operation expands the dimensions of an input tensor according to specified broadcast dimensions.

This operation takes an input tensor and broadcasts it to a larger shape by repeating elements along dimensions where the input has size 1 and the output has a larger size. This is commonly used to make tensors compatible for elementwise operations.

Example:

// Broadcast a tensor from shape [1, 1, 32] to [1, 16, 32]
%input = ... : tensor<1x1x32xf32>
%output = ttir.empty() : tensor<1x16x32xf32>
%result = ttir.broadcast(%input, %output) {broadcast_dimensions = [1, 16, 1]} :
    tensor<1x1x32xf32>, tensor<1x16x32xf32> -> tensor<1x16x32xf32>
// The input tensor is repeated 16 times along the second dimension

// Broadcast a tensor from shape [1, 3] to [2, 3]
%input = ... : tensor<1x3xf32>
%output = ttir.empty() : tensor<2x3xf32>
%result = ttir.broadcast(%input, %output) {broadcast_dimensions = [2, 1]} :
    tensor<1x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// The input tensor is repeated 2 times along the first dimension

Note: Currently, when generating a TTNN executable, the broadcast and repeat operations share the same semantics due to the lack of tensor view support in TTNN. As a result, the broadcast operation is lowered to a repeat operation in the TTNN compilation pipeline.

Inputs:

input (Tensor): The input tensor to broadcast.

Attributes:

broadcast_dimensions (Array of Integer): The number of times to broadcast the tensor along each dimension.

Outputs:

result (Tensor): The broadcasted tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`broadcast_dimensions`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.cbrt` (tt::ttir::CbrtOp)

Elementwise cubic root operation.

The cbrt operation computes the cubic root (∛) of each element in the input tensor.

For each element, it returns the real-valued number that, when cubed, equals the input value. Unlike square root, cubic root is defined for negative numbers as well as positive numbers.

Example:

// Compute cubic root of all elements in %input
%result = ttir.cbrt(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[8.0, 27.0, -8.0, 1.0], ... ]
// Output tensor:
// [[2.0, 3.0, -2.0, 1.0], ... ]

// Example with different values
%result = ttir.cbrt(%float_input, %float_output) : tensor<3x2xf32>, tensor<3x2xf32> -> tensor<3x2xf32>
// Input tensor:
// [[125.0, -27.0],
//  [0.0, 0.001],
//  [1000.0, -1.0]]
// Output tensor:
// [[5.0, -3.0],
//  [0.0, 0.1],
//  [10.0, -1.0]]

Mathematical definition: cbrt(x) = ∛x = x^(1/3)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.ceil` (tt::ttir::CeilOp)

Elementwise ceiling operation.

The ceil operation computes the ceiling (smallest integer greater than or equal to x) of each element in the input tensor.

For each element, it rounds the value up to the nearest integer. The operation preserves the data type of the input.

This operation has the idempotence property, meaning that applying it multiple times produces the same result as applying it once: ceil(ceil(x)) = ceil(x).

Example:

// Compute ceiling of all elements in %input
%result = ttir.ceil(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[2.0, 2.0, 0.0, 5.0], ... ]

// Example with different values
%result = ttir.ceil(%float_input, %float_output) : tensor<3x2xf32>, tensor<3x2xf32> -> tensor<3x2xf32>
// Input tensor:
// [[3.14, -2.5],
//  [0.0, 0.001],
//  [9.999, -0.0]]
// Output tensor:
// [[4.0, -2.0],
//  [0.0, 1.0],
//  [10.0, 0.0]]

Mathematical definition: ceil(x) = ⌈x⌉ = min{n ∈ ℤ | n ≥ x}

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Idempotence, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.clamp_scalar` (tt::ttir::ClampScalarOp)

Scalar value clamping operation.

The clamp_scalar operation constrains all elements of a tensor to be within a specified range.

This operation applies element-wise clamping to the input tensor, ensuring that all values fall within the range [min, max]. Values less than min are set to min, and values greater than max are set to max. This is commonly used to ensure that tensor values stay within a valid range.

Example:

// Clamp values to the range [2.0, 5.0]
%input = ... : tensor<1x8xf32>  // Input tensor with values:
                                // [[0, 1, 2, 3, 4, 5, 6, 7]]
%output = ttir.empty() : tensor<1x8xf32>  // Output tensor shape
%result = ttir.clamp_scalar(%input, %output) {
    min = 2.0 : f32,  // Minimum value
    max = 5.0 : f32   // Maximum value
} : tensor<1x8xf32>, tensor<1x8xf32> -> tensor<1x8xf32>
// Result: [[2, 2, 2, 3, 4, 5, 5, 5]]
// Values < 2.0 are clamped to 2.0, values > 5.0 are clamped to 5.0

Inputs:

input (Tensor): The input tensor to clamp.

Attributes:

min (Float): The minimum value for clamping.
max (Float): The maximum value for clamping.

Outputs:

result (Tensor): The clamped tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`min`	::mlir::FloatAttr	32-bit float attribute
`max`	::mlir::FloatAttr	32-bit float attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.clamp_tensor` (tt::ttir::ClampTensorOp)

Tensor value clamping operation.

The clamp_tensor operation constrains elements of a tensor to be within ranges specified by min and max tensors.

Unlike clamp_scalar, which uses scalar values for min and max, this operation uses tensor values for element-wise clamping. Each element in the input tensor is clamped between the corresponding elements in the min and max tensors. This allows for different clamping ranges for different elements.

Example:

// Clamp values using min and max tensors
%input = ... : tensor<1x8xf32>  // Input tensor with values:
                                // [[0, 1, 2, 3, 4, 5, 6, 7]]
%min = ... : tensor<1x8xf32>    // Min tensor with values:
                                // [[2, 2, 2, 3, 3, 3, 0, 0]]
%max = ... : tensor<1x8xf32>    // Max tensor with values:
                                // [[5, 5, 5, 9, 9, 9, 6, 6]]
%output = ttir.empty() : tensor<1x8xf32>  // Output tensor shape
%result = ttir.clamp_tensor(%input, %min, %max, %output) :
    tensor<1x8xf32>, tensor<1x8xf32>, tensor<1x8xf32>, tensor<1x8xf32> -> tensor<1x8xf32>
// Result: [[2, 2, 2, 3, 4, 5, 6, 6]]
// Each element is clamped between its corresponding min and max values

Inputs:

input (Tensor): The input tensor to clamp.
min (Tensor): The tensor containing minimum values for clamping.
max (Tensor): The tensor containing maximum values for clamping.

Outputs:

result (Tensor): The clamped tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`min`	ranked tensor of any type values
`max`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.collective_broadcast` (tt::ttir::CollectiveBroadcastOp)

Collective Broadcast op

The collective_broadcast operation distributes a tensor from a single source device to all other devices within each replica group. Each replica group defines a subset of devices that participate in the broadcast, and the operation is applied independently within each group.

By convention, the first device listed in each replica group is treated as the broadcast source. The value of the input tensor on that source device is sent to all other devices in the same group. The input tensor values on non-source devices are ignored and will be overwritten during the operation.

Inputs:

input: The tensor to broadcast. Only the value on the first device of each replica group (the source) is used; values on other devices are ignored.
output: An allocated tensor that will be populated with the broadcasted value on all devices.
replica_groups: A list of replica groups. Each group is a list of device IDs, and the first ID in each group is treated as the broadcast source for that group.

Result:

result: The output tensor containing the broadcasted value, identical across all devices in the same replica group.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`replica_groups`	::mlir::DenseIntElementsAttr	64-bit signless integer elements attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.collective_permute` (tt::ttir::CollectivePermuteOp)

Collective permute operation.

Collective permute op. This operation ingests a multi-device tensor spread across multi-devices and will shuffle the data according to source_target_pairs [['src', 'dest']].

Example: For a 1x2 mesh, the following will take the device shard living in device 0 and move it to device 1. The device shard living in device 1 will move to device 0. %source_target_pairs: [[0, 1], [1, 0]]

In the case of missing 'dest', the device shard living on that device will contain values of 0. For example, device shard living in device 0 will contain 0 values. %source_target_pairs: [[0, 1]]

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`source_target_pairs`	::mlir::DenseIntElementsAttr	64-bit signless integer elements attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.concat` (tt::ttir::ConcatOp)

Tensor concatenation operation.

The concat operation joins multiple tensors along a specified dimension.

This operation concatenates a list of tensors along the dimension specified by dim. All input tensors must have the same shape except for the dimension being concatenated, and the output tensor's shape will match the input tensors except for the concatenated dimension, which will be the sum of the input dimensions.

Example:

// Concatenate along dimension 0
%input1 = ... : tensor<2x3xf32>
%input2 = ... : tensor<3x3xf32>
%output = ttir.empty() : tensor<5x3xf32>
%result = ttir.concat(%input1, %input2, %output) {dim = 0 : i32} :
    tensor<2x3xf32>, tensor<3x3xf32>, tensor<5x3xf32> -> tensor<5x3xf32>
// Input1 shape: [2, 3]
// Input2 shape: [3, 3]
// Output shape: [5, 3]

// Concatenate along dimension 1
%input1 = ... : tensor<2x3xf32>
%input2 = ... : tensor<2x2xf32>
%output = ttir.empty() : tensor<2x5xf32>
%result = ttir.concat(%input1, %input2, %output) {dim = 1 : i32} :
    tensor<2x3xf32>, tensor<2x2xf32>, tensor<2x5xf32> -> tensor<2x5xf32>
// Input1 shape: [2, 3]
// Input2 shape: [2, 2]
// Output shape: [2, 5]

Inputs:

inputs (Variadic Tensor): A list of input tensors to concatenate.

Attributes:

dim (Integer): The dimension along which to concatenate the tensors.

Outputs:

result (Tensor): The concatenated tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dim`	::mlir::IntegerAttr	32-bit signed integer attribute

Operands:

Operand	Description
`inputs`	variadic of ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.concatenate_heads` (tt::ttir::ConcatenateHeadsOp)

Concatenate heads operation.

The concatenate_heads operation concatenates multiple heads of a multi-head attention tensor into a single tensor.

This operation is typically used in transformer models where the attention mechanism is split into multiple heads. It combines the outputs of these heads into a single tensor, allowing further processing.

It takes an input tensor with shape [batch_size, num_heads, sequence_size, head_size] and produces an output tensor with shape [batch_size, sequence_size, num_heads * head_size].

It corresponds to a sequence of permute and reshape operations.

Example:

// Concatenate heads from a multi-head attention output
%input = ... : tensor<1x24x32x128xbf16> // batch_size: 1, num_heads: 24, sequence_size: 32, head_size: 128
%output = ttir.empty() : tensor<1x32x3072xbf16>  // Concatenated output of shape [batch_size, sequence_size, num_heads * head_size]
%result = ttir.concatenate_heads(%input, %output) : tensor<1x24x32x128xbf16>, tensor<1x32x3072xbf16> -> tensor<1x32x3072xbf16>
// Input tensor shape: [1, 24, 32, 128]
// Output tensor shape: [1, 32, 3072]

Inputs:

input (Tensor): The input tensor containing multiple heads.

Outputs:

result (Tensor): The concatenated output tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.constant` (tt::ttir::ConstantOp)

Tensor constant creation operation.

The constant operation creates a tensor with values specified by a constant attribute.

This operation is used to create tensors with predefined values that remain constant throughout program execution. It's commonly used for initializing model weights, biases, and other fixed parameters in neural networks.

Example:

// Create a 2D tensor of zeros
%result = ttir.constant() {
    value = dense<0> : tensor<2x3xi32>
} : () -> tensor<2x3xi32>
// Result: [[0, 0, 0], [0, 0, 0]]

// Create a 1D tensor with specific floating-point values
%result = ttir.constant() {
    value = dense<[0.2, 1.3]> : tensor<2xf32>
} : () -> tensor<2xf32>
// Result: [0.2, 1.3]

// Create a scalar constant
%result = ttir.constant() {
    value = dense<5.0> : tensor<f32>
} : () -> tensor<f32>
// Result: 5.0

// Create a 2D tensor with different values
%result = ttir.constant() {
    value = dense<[[1, 2, 3], [4, 5, 6]]> : tensor<2x3xi32>
} : () -> tensor<2x3xi32>
// Result: [[1, 2, 3], [4, 5, 6]]

Attributes:

value (DenseElementsAttr): The constant value of the tensor.

Outputs:

result (Tensor): The tensor with the specified constant values.

Note: The shape and element type of the result tensor are determined by the value attribute. The constant operation is typically folded during compilation, allowing for optimizations such as constant propagation.

Traits: AlwaysSpeculatableImplTrait, ConstantLike, TTCore_CreationOpTrait

Interfaces: BufferizableOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`value`	::mlir::ElementsAttr	constant vector/tensor attribute

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.conv2d` (tt::ttir::Conv2dOp)

Conv2d operation.

Applies a 2D convolution over an input image composed of several input planes.

This operation performs a 2D convolution on the input tensor using the provided weight tensor and optional bias. It supports configurable stride, padding, dilation, and grouping parameters to control the convolution behavior.

Example:

// Basic 2D convolution
%input = ... : tensor<1x28x28x3xf32>    // Batch size 1, 28x28 image, 3 channels
%weight = ... : tensor<16x3x3x3xf32>    // 16 output channels, 3 input channels, 3x3 kernel
%bias = ... : tensor<1x1x1x16xf32>      // Bias for 16 output channels
%output = ttir.empty() : tensor<1x26x26x16xf32>  // Output shape with no padding
%result = ttir.conv2d(%input, %weight, %bias, %output) {
    stride = [1, 1],
    padding = [0, 0, 0, 0],
    dilation = [1, 1],
    groups = 1
} : tensor<1x28x28x3xf32>, tensor<16x3x3x3xf32>, tensor<1x1x1x16xf32>, tensor<1x26x26x16xf32> -> tensor<1x26x26x16xf32>

// Convolution with stride 2 and padding
%input = ... : tensor<1x28x28x3xf32>    // Batch size 1, 28x28 image, 3 channels
%weight = ... : tensor<16x3x3x3xf32>    // 16 output channels, 3 input channels, 3x3 kernel
%bias = ... : tensor<1x1x1x16xf32>      // Bias for 16 output channels
%output = ttir.empty() : tensor<1x14x14x16xf32>  // Output shape with stride 2
%result = ttir.conv2d(%input, %weight, %bias, %output) {
    stride = [2, 2],
    padding = [1, 1, 1, 1],
    dilation = [1, 1],
    groups = 1
} : tensor<1x28x28x3xf32>, tensor<16x3x3x3xf32>, tensor<1x1x1x16xf32>, tensor<1x14x14x16xf32> -> tensor<1x14x14x16xf32>

Inputs:

input (AnyRankedTensor): expected in the following format (N, H_in, W_in, C) where:
- N is the batch size
- H_in is the height of the input planes
- W_in is the width of the input planes
- C is the number of channels
weight (AnyRankedTensor): expected in the following format (O, C/G, K_H, K_W) where:
- C is the number of input channels
- O is the number of output channels
- G is the number of groups
- K_H is the height of the kernel
- K_W is the width of the kernel
bias Optional: expected in the following format (1, 1, 1, O).

Attributes:

stride (i32 | array<2xi32>):
- i32: Same stride for height and width dimensions (sH = sW = value).
- array<2xi32>: [sH, sW] where sH is stride for height and sW is stride for width.
padding (i32 | array<2xi32> | array<4xi32>):
- i32: Same padding for all sides (pT = pL = pB = pR = value).
- array<2xi32>: [pH, pW] where pH is padding for height (top/bottom) and pW is padding for width (left/right).
- array<4xi32>: [pT, pL, pB, pR] for top, left, bottom, and right padding respectively.
dilation (i32 | array<2xi32>): Spacing between kernel elements.
- i32: Same dilation for height and width dimensions (dH = dW = value).
- array<2xi32>: [dH, dW] where dH is dilation for height and dW is dilation for width.
groups (i32): Number of blocked connections from input channels to output channels. Input and output channels must both be divisible by groups.

Outputs:

result AnyRankedTensor: expected in the following format (N, H_out, W_out, O) where:
- H_out = (H_in + pT + pB - dH * (K_H - 1) - 1) / sH + 1
- W_out = (W_in + pL + pR - dW * (K_W - 1) - 1) / sW + 1

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`stride`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`padding`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`dilation`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`groups`	::mlir::IntegerAttr	32-bit signless integer attribute
`flattened_compat_info`	::mlir::tt::ttir::FlattenedCompatInfoAttr	Information for sliding window operations with tensors flattened to (1, 1, NHW, C) {{% markdown %}} This attribute marks operations that are compatible with flattened tensors. It is used as a marker and doesn't carry any additional data. {{% /markdown %}}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`weight`	ranked tensor of any type values
`bias`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.conv_transpose2d` (tt::ttir::ConvTranspose2dOp)

ConvTranspose2d operation.

Applies a 2D transposed convolution operator over an input image composed of several input planes.

This operation performs the gradient of a 2D convolution with respect to the input, which is useful for tasks like upsampling feature maps in neural networks. It supports configurable stride, padding, dilation, output padding, and grouping parameters.

Example:

// Basic 2D transposed convolution
%input = ... : tensor<1x14x14x16xf32>   // Batch size 1, 14x14 feature map, 16 channels
%weight = ... : tensor<16x8x3x3xf32>    // 16 input channels, 8 output channels, 3x3 kernel
%bias = ... : tensor<1x1x1x8xf32>       // Bias for 8 output channels
%output = ttir.empty() : tensor<1x28x28x8xf32>  // Output shape with stride 2
%result = ttir.conv_transpose2d(%input, %weight, %bias, %output) {
    stride = [2, 2],
    padding = [0, 0, 0, 0],
    dilation = [1, 1],
    output_padding = [0, 0],
    groups = 1
} : tensor<1x14x14x16xf32>, tensor<16x8x3x3xf32>, tensor<1x1x1x8xf32>, tensor<1x28x28x8xf32> -> tensor<1x28x28x8xf32>

// Transposed convolution with padding and output padding
%input = ... : tensor<1x14x14x16xf32>   // Batch size 1, 14x14 feature map, 16 channels
%weight = ... : tensor<16x8x4x4xf32>    // 16 input channels, 8 output channels, 4x4 kernel
%bias = ... : tensor<1x1x1x8xf32>       // Bias for 8 output channels
%output = ttir.empty() : tensor<1x29x29x8xf32>  // Output shape with output padding
%result = ttir.conv_transpose2d(%input, %weight, %bias, %output) {
    stride = [2, 2],
    padding = [1, 1, 1, 1],
    dilation = [1, 1],
    output_padding = [1, 1],
    groups = 1
} : tensor<1x14x14x16xf32>, tensor<16x8x4x4xf32>, tensor<1x1x1x8xf32>, tensor<1x29x29x8xf32> -> tensor<1x29x29x8xf32>

Inputs:

input AnyRankedTensor: expected in the following format (N, H_in, W_in, C) where:
- N is the batch size
- H_in is the height of the input planes
- W_in is the width of the input planes
- C is the number of channels
weight (AnyRankedTensor): expected in the following format (C, O/G, K_H, K_W) where:
- C is the number of input channels
- O is the number of output channels
- G is the number of groups
- K_H is the height of the kernel
- K_W is the width of the kernel
bias Optional: expected in the following format (1, 1, 1, O).

Attributes:

stride (i32 | array<2xi32>): Controls the stride for the cross-correlation.
padding (i32 | array<2xi32> | array<4xi32>): Controls the amount of implicit zero padding on both sides for dilation * (kernel_size - 1) - padding number of points.
output_padding (i32 | array<2xi32>): Controls the additional size added to one side of the output shape.
dilation (i32 | array<2xi32>): Controls the spacing between the kernel points
groups i32: Controls the connections between inputs and outputs. Must be divisible by input and output channels.

Outputs:

result AnyRankedTensor: expected in the following format (N, H_out, W_out, O) where:
- H_out = (H_in - 1) * stride[0] - (padding_top + padding_bottom) + dilation[0] * (K_H - 1) + output_padding[0] + 1
- W_out = (W_in - 1) * stride[1] - (padding_left + padding_right) + dilation[1] * (K_W - 1) + output_padding[1] + 1

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`stride`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`padding`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`output_padding`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`dilation`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`groups`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`weight`	ranked tensor of any type values
`bias`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.convolution` (tt::ttir::ConvolutionOp)

Generalized convolution operation.

This operation is a more flexible form of convolution that can handle arbitrary dimensionality and supports various configuration options. It's designed to be a generalization of specific convolution operations like conv2d and conv_transpose2d.

Example:

// 2D convolution using the generalized convolution operation
%lhs = ... : tensor<1x32x32x3xf32>     // Input tensor: batch size 1, 32x32 image, 3 channels
%rhs = ... : tensor<5x5x3x16xf32>      // Filter tensor: 5x5 kernel, 3 input channels, 16 output channels
%output = ttir.empty() : tensor<1x28x28x16xf32>  // Output tensor
%result = ttir.convolution(%lhs, %rhs, %output) {
    window_strides = [1, 1],
    padding = [[0, 0], [0, 0]],
    lhs_dilation = [1, 1],
    rhs_dilation = [1, 1],
    window_reversal = [false, false],
    dimension_numbers = {
        input_batch_dimension = 0,
        input_feature_dimension = 3,
        input_spatial_dimensions = [1, 2],
        kernel_input_feature_dimension = 2,
        kernel_output_feature_dimension = 3,
        kernel_spatial_dimensions = [0, 1],
        output_batch_dimension = 0,
        output_feature_dimension = 3,
        output_spatial_dimensions = [1, 2]
    },
    feature_group_count = 1,
    batch_group_count = 1
} : tensor<1x32x32x3xf32>, tensor<5x5x3x16xf32>, tensor<1x28x28x16xf32> -> tensor<1x28x28x16xf32>

Inputs:

input - The input tensor.
weight - The filter/kernel tensor.
bias - The bias tensor.

Attributes:

window_strides (Array): Stride of the sliding window for each spatial dimension.
padding (Array): Padding applied to the input in each spatial dimension.
input_dilation (Array): Dilation factor for the input in each spatial dimension.
weight_dilation (Array): Dilation factor for the filter in each spatial dimension.
window_reversal (Array): Whether to reverse the window in each spatial dimension.
convolution_layout (Struct): Specifies the dimension numbering in the inputs and outputs.
feature_group_count (Integer): Number of feature groups for grouped convolution.
batch_group_count (Integer): Number of batch groups for grouped convolution.

Outputs:

result (Tensor): Output tensor containing the result of the convolution.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), QuantizableOpInterface, TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`window_strides`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`padding`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`input_dilation`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`weight_dilation`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`window_reversal`	::mlir::DenseBoolArrayAttr	i1 dense array attribute
`convolution_layout`	::mlir::tt::ttir::ConvolutionLayoutAttr	Structure of dimension information for convolution op {{% markdown %}} Holds the layout information for the input activation, weights, and output. {{% /markdown %}}
`feature_group_count`	::mlir::IntegerAttr	64-bit signless integer attribute whose value is positive
`batch_group_count`	::mlir::IntegerAttr	64-bit signless integer attribute whose value is positive

Operands:

Operand	Description
`input`	ranked tensor of any type values
`weight`	ranked tensor of any type values
`bias`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
«unnamed»	ranked tensor of any type values

`ttir.cos` (tt::ttir::CosOp)

Elementwise cosine operation.

The cos operation computes the cosine of each element in the input tensor.

For each element, it returns the cosine of the angle in radians.

Example:

// Compute cosine of all elements in %input
%result = ttir.cos(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.9601, 0.5403, -0.9553, -0.1365], ... ]

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.cumsum` (tt::ttir::CumSumOp)

Cumulative sum operation.

The cumsum operation computes the cumulative sum of elements along a specified dimension of the input tensor.

For each position in the output tensor, this operation computes the sum of all elements in the input tensor along the specified dimension up to and including that position. The shape of the output tensor matches the shape of the input tensor.

Example:

// Cumulative sum along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2x3xf32>
%result = ttir.cumsum(%input, %output) {dim = 0 : i64} : tensor<2x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// Input tensor:
// [[1, 2, 3],
//  [4, 5, 6]]
// Output tensor:
// [[1, 2, 3],   // first row remains the same
//  [5, 7, 9]]   // each element is the sum of the corresponding column up to this point

// Cumulative sum along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2x3xf32>
%result = ttir.cumsum(%input, %output) {dim = 1 : i64} : tensor<2x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// Input tensor:
// [[1, 2, 3],
//  [4, 5, 6]]
// Output tensor:
// [[1, 3, 6],   // each element is the sum of the corresponding row up to this point
//  [4, 9, 15]]

Inputs:

input (Tensor): The input tensor.

Attributes:

dim (Integer): The dimension along which to compute the cumulative sum.

Outputs:

result (Tensor): The tensor containing the cumulative sums.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dim`	::mlir::IntegerAttr	64-bit signless integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.dealloc` (tt::ttir::DeallocOp)

Dealloc op.

Tensor Dealloc operation

Operands:

Operand	Description
`result`	ranked tensor of any type values

`ttir.dequantize` (tt::ttir::DequantizeOp)

Dequantize operation.

The Dequantize operation converts a quantized tensor back into a floating-point tensor using the quant.uniform type from the MLIR Quant dialect. The input tensor is expected to be of type quant.uniform. The output tensor will be a floating-point tensor, where each element is computed as:

output[i] = (input[i] - zero_point) * scale

Example:

%input = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%output = ttir.empty() : () -> tensor<64x128xf32>
%dequantized = "ttir.dequantize"(%input, %output) : (tensor<64x128x!quant.uniform<i32:f32, 0.1>, tensor<64x128xf32>) -> tensor<64x128xf32>

// In this example:
// - The input is a 64x128 tensor of 32-bit quantized values
// - The output is a 64x128 tensor of 32-bit floating-point values
// - The scale is 0.1 (each step represents 0.1 in the original scale)
// - The zero point is 128 (the value 128 in the quantized space represents 0.0 in the original space)

Inputs:

input (Quantized Tensor): The quantized tensor to be dequantized.

Results:

result (Tensor): The floating-point tensor after dequantization.

Note: The quantization parameters (scale and zero point) are specified in the input tensor type. Dequantization is the reverse process of quantization, converting quantized values back to floating-point values.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.dequantize_unrolled` (tt::ttir::DequantizeUnrolledOp)

Dequantize operation unrolled (scale and zero point as input operands).

The DequantizeUnrolledOp dequantizes a tensor using the scale and zero point provided as input operands.

Inputs:

input AnyRankedTensor: The input tensor to be dequantized. Must have quantized element type.
scale AnyRankedTensor: The scale factor (or factors for per-axis quantization).
zero_point AnyRankedTensor: The zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`axis`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`scale`	ranked tensor of any type values
`zero_point`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.div` (tt::ttir::DivOp)

Elementwise division operation.

The div operation performs an elementwise division between two tensors.

For each pair of corresponding elements, it divides the element in the first tensor (dividend) by the element in the second tensor (divisor) and places the result in the output tensor.

Example:

// Division operation
%result = ttir.div(%lhs, %rhs, %output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensors:
// %lhs: [10, 20, 20]
// %rhs: [1, 2, 3]
// Output tensor:
// [10, 10, 6]

// Example with floating point values
%result = ttir.div(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [3.5, 0.0, -1.2]
// %float_rhs: [1.5, 2.0, -3.2]
// Output tensor:
// [2.333333333, 0.0, -0.375]

Note: Division by zero typically results in undefined behavior or NaN for floating-point types.

Mathematical definition: div(x, y) = x / y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.dot_general` (tt::ttir::DotGeneralOp)

Dot general operation.

Flexible tensor operation that generalizes matrix multiplication by allowing user to specify which dimensions of two tensors to contract. Matrix multiplication is a special case of this operation, where the contraction happens along the last axis of the first tensor and the second-to-last axis of the second tensor. From StableHLO DotGeneral Op https://openxla.org/stablehlo/spec#dot_general

Attributes:

Attribute	MLIR Type	Description
`batch_dims_lhs`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`contract_dims_lhs`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`batch_dims_rhs`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`contract_dims_rhs`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.embedding_backward` (tt::ttir::EmbeddingBackwardOp)

Embedding backward operation.

The embedding_backward operation computes the gradient of the embedding operation with respect to the weight tensor.

This operation takes an input tensor of indices, the original weight tensor, and the gradient tensor from the forward pass. It computes how the embedding weights should be updated during backpropagation by accumulating gradients at the appropriate indices in the weight tensor.

Example:

// Embedding backward
%input = ... : tensor<2x3xi32>  // Original indices used in the forward pass
%weight = ... : tensor<10x4xf32>  // Original embedding table
%in_gradient = ... : tensor<2x3x4xf32>  // Gradient from the forward pass
%output = ttir.empty() : tensor<10x4xf32>  // Gradient for the embedding table
%result = ttir.embedding_backward(%input, %weight, %in_gradient, %output) :
    tensor<2x3xi32>, tensor<10x4xf32>, tensor<2x3x4xf32>, tensor<10x4xf32> -> tensor<10x4xf32>

// Input tensor (indices):
// [[0, 2, 5],
//  [7, 1, 9]]

// Input gradient tensor (from forward pass):
// [[[0.1, 0.2, 0.3, 0.4],  // gradient for embedding of index 0
//   [0.5, 0.6, 0.7, 0.8],  // gradient for embedding of index 2
//   [...]],                 // gradient for embedding of index 5
//  [[...],                  // gradient for embedding of index 7
//   [0.9, 1.0, 1.1, 1.2],  // gradient for embedding of index 1
//   [...]]]                 // gradient for embedding of index 9

// Output tensor (gradient for the embedding table):
// The gradients are accumulated at the corresponding indices in the weight tensor.
// For example, at index 0, the gradient is [0.1, 0.2, 0.3, 0.4]

Note: If the same index appears multiple times in the input tensor, the gradients are accumulated (added) at that index in the output tensor.

Inputs:

input (Tensor): The original input tensor containing indices used in the forward pass.
weight (Tensor): The original embedding table tensor.
in_gradient (Tensor): The gradient tensor from the forward pass.

Outputs:

result (Tensor): The gradient tensor for the embedding table.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`weight`	ranked tensor of any type values
`in_gradient`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.embedding` (tt::ttir::EmbeddingOp)

Embedding lookup operation.

The embedding operation performs a lookup in an embedding table (weight matrix) using integer indices.

This operation takes an input tensor of indices and a weight tensor representing the embedding table. For each index in the input tensor, it retrieves the corresponding row from the weight tensor. The result is a tensor where each input index is replaced by its corresponding embedding vector.

Example:

// Embedding lookup
%input = ... : tensor<2x3xi32>  // Batch of indices
%weight = ... : tensor<10x4xf32>  // Embedding table with 10 entries of dimension 4
%output = ttir.empty() : tensor<2x3x4xf32>
%result = ttir.embedding(%input, %weight, %output) : tensor<2x3xi32>, tensor<10x4xf32>, tensor<2x3x4xf32> -> tensor<2x3x4xf32>

// Input tensor (indices):
// [[0, 2, 5],
//  [7, 1, 9]]

// Weight tensor (embedding table):
// [[0.1, 0.2, 0.3, 0.4],  // embedding vector for index 0
//  [0.5, 0.6, 0.7, 0.8],  // embedding vector for index 1
//  [0.9, 1.0, 1.1, 1.2],  // embedding vector for index 2
//  ...
//  [1.7, 1.8, 1.9, 2.0]]  // embedding vector for index 9

// Output tensor:
// [[[0.1, 0.2, 0.3, 0.4],  // embedding for index 0
//   [0.9, 1.0, 1.1, 1.2],  // embedding for index 2
//   [...]],                 // embedding for index 5
//  [[...],                  // embedding for index 7
//   [0.5, 0.6, 0.7, 0.8],  // embedding for index 1
//   [...]]]                 // embedding for index 9

Note: The indices in the input tensor must be valid indices into the first dimension of the weight tensor.

Inputs:

input (Tensor): The input tensor containing indices.
weight (Tensor): The embedding table tensor.

Outputs:

result (Tensor): The resulting tensor containing the embeddings.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`weight`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.empty` (tt::ttir::EmptyOp)

Empty tensor allocation operation.

Syntax:

operation ::= `ttir.empty` `(` `)` attr-dict `:` type($result)

The empty operation creates an uninitialized tensor with the specified shape and element type.

This operation allocates memory for a tensor but does not initialize its values. It's commonly used as a first step before filling the tensor with computed values. The shape and element type of the tensor are determined by the return type.

Example:

// Create an uninitialized 2D tensor with shape [3, 4]
%result = ttir.empty() : tensor<3x4xf32>

// Create an uninitialized 3D tensor with shape [2, 3, 4]
%result = ttir.empty() : tensor<2x3x4xi32>

// Use empty to create a tensor for storing computation results
%input = ... : tensor<10x20xf32>
%output = ttir.empty() : tensor<10x20xf32>
%result = ttir.some_computation(%input, %output) : tensor<10x20xf32>, tensor<10x20xf32> -> tensor<10x20xf32>

Outputs:

result (Tensor): The uninitialized tensor.

Note: Since the tensor is uninitialized, reading from it before writing may yield undefined values. This operation is typically used in conjunction with other operations that will fill the tensor with meaningful values. The empty operation is more efficient than zeros or ones when the tensor will be completely overwritten, as it avoids the initialization step.

Traits: AlwaysSpeculatableImplTrait, TTCore_CreationOpTrait

Interfaces: BufferizableOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.eq` (tt::ttir::EqualOp)

Elementwise equality comparison operation.

The eq operation performs an elementwise equality comparison between two tensors.

For each pair of corresponding elements, it returns:

1 (true) if the elements are equal
0 (false) if the elements are not equal

Note that special handling may be required for floating-point NaN values, as NaN is not equal to any value, including itself.

Example:

// Compare elements for equality
%result = ttir.eq(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[1, 1, 0, 0], ... ]  // 1 where equal, 0 where not equal

// Example with integer tensors
%result = ttir.eq(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [1, 0, 0]  // Only the first elements are equal

Mathematical definition: equal(x, y) = x == y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.erf` (tt::ttir::ErfOp)

Element-wise error function operation.

Element-wise error function (erf) operation. Calculates erf(x) for each element of the input tensor.

Example:

// Compute error function for all elements in %input
%result = ttir.erf(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor with values [0.0, 1.0, -1.0, 2.0]
// Output tensor with values [0.0, 0.8427, -0.8427, 0.9953]

Mathematical definition: erf(x) = (2/√π) ∫₀ˣ e^(-t²) dt

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.erfc` (tt::ttir::ErfcOp)

Element-wise complementary error function operation.

Element-wise complementary error function (erfc) operation. Calculates erfc(x) = 1 - erf(x) for each element of the input tensor.

Example:

// Compute complementary error function for all elements in %input
%result = ttir.erfc(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor with values [0.0, 1.0, -1.0, 2.0]
// Output tensor with values [1.0, 0.1573, 1.8427, 0.0047]

Mathematical definition: erfc(x) = 1 - erf(x) = 1 - (2/√π) ∫ₓ^∞ e^(-t²) dt

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.exp` (tt::ttir::ExpOp)

Elementwise exponential op.

The exp operation computes the exponential of each element in the input tensor.

For each element, it returns e^x, where e is the base of natural logarithms (approximately 2.71828).

Example:

// Compute exponential of all elements in %input
%result = ttir.exp(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.0, 2.0, -3.0, 4.0], ... ]
// Output tensor:
// [[2.71828, 7.389056, 0.090031, 54.59815], ... ]

Mathematical definition: exp(x) = e^x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.expm1` (tt::ttir::Expm1Op)

Elementwise exponential minus one operation.

The expm1 operation computes the exponential of each element in the input tensor and subtracts one.

For each element x, it returns e^x - 1. This operation is more accurate than computing exp(x) - 1 directly for x values close to zero, where catastrophic cancellation can occur in the subtraction.

Example:

// Compute expm1 of all elements in %input
%result = ttir.expm1(%input, %output) : tensor<2x2xf32>, tensor<2x2xf32> -> tensor<2x2xf32>
// Input tensor:
// [[0.0, 1.0],
//  [0.0, 0.0]]
// Output tensor:
// [[0.0, 1.71828],
//  [0.0, 0.0]]

// Example with small values where expm1 is more accurate than exp(x)-1
%result = ttir.expm1(%small_input, %small_output) : tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [1e-10, 1e-7, 1e-5]
// Output tensor:
// [1e-10, 1e-7, 1e-5]  // Approximately equal to the input for very small values

Mathematical definition: expm1(x) = e^x - 1

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.fill_cache` (tt::ttir::FillCacheOp)

Cache filling operation.

The fill_cache operation fills a cache tensor with values from an input tensor.

Unlike update_cache which updates specific positions, this operation fills the entire cache or a contiguous section of it with values from the input tensor. This is commonly used to initialize a cache in sequence models.

Example:

// Fill cache with input values
%cache = ... : tensor<2x16x64xf32>  // Batch size 2, sequence length 16, hidden dim 64
%input = ... : tensor<2x16x64xf32>  // Initial values for the entire cache
%result = ttir.fill_cache(%cache, %input) {batch_offset = 0 : i32} :
    tensor<2x16x64xf32>, tensor<2x16x64xf32> -> tensor<2x16x64xf32>
// The entire cache tensor is filled with values from input

// Fill a portion of the cache
%cache = ... : tensor<2x16x64xf32>  // Batch size 2, sequence length 16, hidden dim 64
%input = ... : tensor<2x8x64xf32>   // Values for half of the cache
%result = ttir.fill_cache(%cache, %input) {batch_offset = 0 : i32} :
    tensor<2x16x64xf32>, tensor<2x8x64xf32> -> tensor<2x16x64xf32>
// The first 8 positions of the cache are filled with values from input

Inputs:

cache (Tensor): The cache tensor to be filled.
input (Tensor): The input tensor containing the values to fill the cache with.

Attributes:

batch_offset (Integer): Offset in the batch dimension.

Outputs:

result (Tensor): The filled cache tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`batch_offset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:

Operand	Description
`cache`	ranked tensor of any type values
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.floor` (tt::ttir::FloorOp)

Elementwise floor operation.

The floor operation computes the floor (greatest integer less than or equal to x) of each element in the input tensor.

For each element, it rounds the value down to the nearest integer. The operation preserves the data type of the input.

This operation has the idempotence property, meaning that applying it multiple times produces the same result as applying it once: floor(floor(x)) = floor(x).

Example:

// Compute floor of all elements in %input
%result = ttir.floor(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[1.0, 2.0, -1.0, 4.0], ... ]

Mathematical definition: floor(x) = ⌊x⌋ = max{n ∈ ℤ | n ≤ x}

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Idempotence, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.full` (tt::ttir::FullOp)

Creates a tensor filled with the specified value

Tensor operation to create a tensor filled with a specified value.

Given a shape and a fill_value, produces a tensor with the shape, filled with the specified value.

Example: %0 = "ttir.full"() <{shape = array<i32: 64, 32, 32>, fill_value = 7 : i32}> : () -> tensor<64x32x32xi32> // %0: [[[7, 7, 7, ..., 7], [7, 7, 7, ..., 7], ..., [7, 7, 7, ..., 7]]]

Traits: AlwaysSpeculatableImplTrait, TTCore_CreationOpTrait

Interfaces: BufferizableOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`shape`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`fill_value`	::mlir::Attribute	32-bit float attribute or 32-bit signless integer attribute

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.gather` (tt::ttir::GatherOp)

Gather operation.

The gather operation collects slices from an input tensor at positions specified by start indices.

This operation is based on the StableHLO Gather operation (https://openxla.org/stablehlo/spec#gather) and allows for flexible slicing and indexing of tensors. It can be used to implement operations like array indexing, slicing, dynamic indexing, and more complex gathering patterns.

Example:

// Basic gather example: gather elements from a 2D tensor using indices
%input = ... : tensor<5x3xf32>         // Input tensor with shape [5,3]
%indices = ... : tensor<2xi64>         // Indices tensor with values [2, 1]
%output = ttir.empty() : tensor<3xf32> // Output tensor
%result = ttir.gather(%input, %indices, %output) {
    offset_dims = [0],                 // Output dimensions that are gathered from input
    collapsed_slice_dims = [0],        // Input dimensions that are collapsed
    operand_batching_dims = [],        // Batch dimensions of the input
    start_indices_batching_dims = [],  // Batch dimensions of the indices
    start_index_map = [0],             // Maps indices to input dimensions
    index_vector_dim = 0,              // Which dimension of indices contains the index vector
    slice_sizes = [1, 3],              // Size of the slice to extract from each position
    indices_are_sorted = false         // Whether indices are sorted
} : tensor<5x3xf32>, tensor<2xi64>, tensor<3xf32> -> tensor<3xf32>

// This gathers a slice of size [1,3] starting at position [2,0] from the input tensor,
// which results in the values from the third row of the input tensor.

Inputs:

input (Tensor): The tensor from which to gather values.
start_indices (Tensor): Tensor containing the starting indices for slices.

Attributes:

offset_dims (Array of Integer): Output dimensions that correspond to dimensions of the gathered slice.
collapsed_slice_dims (Array of Integer): Input dimensions that are collapsed when gathering.
operand_batching_dims (Array of Integer): Batch dimensions of the input tensor.
start_indices_batching_dims (Array of Integer): Batch dimensions of the indices tensor.
start_index_map (Array of Integer): Maps index values to input dimensions.
index_vector_dim (Integer): Which dimension of indices contains the index vector.
slice_sizes (Array of Integer): Size of the slice to extract from each position.
indices_are_sorted (Boolean): Whether indices are sorted (for optimization).

Outputs:

result (Tensor): The gathered tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`offset_dims`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`collapsed_slice_dims`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`operand_batching_dims`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`start_indices_batching_dims`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`start_index_map`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`index_vector_dim`	::mlir::IntegerAttr	64-bit signed integer attribute
`slice_sizes`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`indices_are_sorted`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`start_indices`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.gelu` (tt::ttir::GeluOp)

Elementwise GELU operation.

The gelu operation computes the GELU (Gaussian Error Linear Unit) of each element in the input tensor.

For each element, it returns the GELU value, which is a smooth, non-monotonic function that approximates the cumulative distribution function of a standard normal distribution. The operation preserves the data type of the input.

Example:

// Compute GELU of all elements in %input
%result = ttir.gelu(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.9601, 0.5403, -0.3, 4.5], ... ]

Mathematical definition: gelu(x) = 0.5 * x * (1 + erf(x / sqrt(2)))

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.generic` (tt::ttir::GenericOp)

Generically dispatch work to a grid of cores.

Syntax:

operation ::= `ttir.generic` attr-dict `\n`
              ` ` ` ` ` ` ` ` `ins` `(` $inputs `:` type($inputs) `)` `\n`
              ` ` ` ` ` ` ` ` `outs` `(` $outputs  `:` type($outputs) `)` ` `  $regions (`:`  type($results)^ )?

This generic op carries a region that represents the work each core does. The region is expected to have the same signature as the op itself with respect to input and output operands. The op is expected to be lowered to a backend specific form by a consuming backend. This op is heavily inspired by the linalg.generic op so it can be useful to refer to linalg.generic documentation for more details.

%5 = "ttir.generic"(%1, %2, %3, %4) <{
  block_factors = [1, 1],                      // The block factors that additionally divide a shard shape into smaller blocks.
  grid = #ttcore.grid<1x1>,                    // The grid range of cores to dispatch work to.
  indexing_maps = [#map, #map, #map],          // Affine maps for indexing into the input/output tensors. See linalg.generic
  iterator_types = [#parallel, #parallel],     // Iterator types for the input/output tensors. See linalg.generic
  threads = [#ttir.thread<compute>],           // Thread types for the regions.
  operandSegmentSizes = array<i32: 2, 1>       // Sizes of the operand segments, i.e. 2 inputs and 1 output.
}> ({
^bb0(%arg2: memref<64x128xf32, #l1_>,
     %arg3: memref<64x128xf32, #l1_>,
     %arg4: memref<64x128xf32, #l1_>):
    // Region body, would contain some computation that represents the work each core does.
}) : (tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>

Traits: AttrSizedOperandSegments, NoTerminator

Interfaces: BufferizableOpInterface, DestinationStyleOpInterface, MemoryEffectOpInterface, OpAsmOpInterface, TTIROpInterface

Attributes:

Attribute	MLIR Type	Description
`grid`	::mlir::tt::ttcore::GridAttr	TT grid attribute {{% markdown %}} TT grid attribute {{% /markdown %}}
`block_factors`	::mlir::ArrayAttr	64-bit integer array attribute
`indexing_maps`	::mlir::ArrayAttr	AffineMap array attribute
`iterator_types`	::mlir::ArrayAttr
`threads`	::mlir::ArrayAttr

Operands:

Operand	Description
`inputs`	variadic of ranked tensor of any type values or non-0-ranked.memref of any type values
`outputs`	variadic of ranked tensor of any type values or non-0-ranked.memref of any type values

Results:

Result	Description
`results`	variadic of ranked tensor of any type values

`ttir.get_dimension_size` (tt::ttir::GetDimensionSizeOp)

GetDimensionSize op.

Produces the size of the given dimension of the operand.

Example: %operand: [[3, 2, 7], [1, 4, 4]] "ttir.get_dimension_size"(%operand, value = dense<0>, %out) -> %out: [[3]]

Attributes:

Attribute	MLIR Type	Description
`dimension`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:

Operand	Description
`operand`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.ge` (tt::ttir::GreaterEqualOp)

Elementwise greater than or equal to.

The ge operation performs an elementwise greater than or equal to comparison between two tensors.

For each pair of corresponding elements, it returns:

1 (true) if the left element is greater than or equal to the right element
0 (false) if the left element is less than the right element

Example:

// Compare elements for greater than or equal to
%result = ttir.ge(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[1, 1, 0, 0], ... ]  // 1 where greater or equal, 0 where less

// Example with integer tensors
%result = ttir.ge(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [1, 0, 0]  // Only the first elements are greater or equal

Mathematical definition: greater_equal(x, y) = x >= y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.gt` (tt::ttir::GreaterThanOp)

Elementwise greater than.

The gt operation performs an elementwise greater than comparison between two tensors.

For each pair of corresponding elements, it returns:

1 (true) if the left element is greater than the right element
0 (false) if the left element is less than or equal to the right element

Example:

// Compare elements for greater than
%result = ttir.gt(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[0, 0, 0, 1], ... ]  // 1 where greater, 0 where less or equal

// Example with integer tensors
%result = ttir.gt(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [0, 0, 0]  // Only the last element is greater

Mathematical definition: greater_than(x, y) = x > y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.index` (tt::ttir::IndexOp)

Tensor indexing operation.

The index operation extracts a sub-tensor (slice) from the input tensor along a specified dimension.

This operation selects elements from the input tensor along a single dimension based on the specified begin, end, and step indices. It's similar to Python's slicing notation tensor[:, begin:end:step, :] where the slicing is applied only to the specified dimension.

Example:

// Extract elements with indices 1, 3, 5 from dimension 0 of a 1D tensor
%input = ... : tensor<6xf32>  // Input tensor with values: [1, 2, 3, 4, 5, 6]
%output = ttir.empty() : tensor<3xf32>  // Output tensor shape
%result = ttir.index(%input, %output) {
    dim = 0 : i32,    // Dimension to index
    begin = 1 : i32,  // Start index
    end = 6 : i32,    // End index (exclusive)
    step = 2 : i32    // Step size
} : tensor<6xf32>, tensor<3xf32> -> tensor<3xf32>
// Result: [2, 4, 6]

// Extract columns 0 and 2 from a 2D tensor
%input = ... : tensor<3x4xf32>  // Input tensor with values:
                                // [[1, 2, 3, 4],
                                //  [5, 6, 7, 8],
                                //  [9, 10, 11, 12]]
%output = ttir.empty() : tensor<3x2xf32>  // Output tensor shape
%result = ttir.index(%input, %output) {
    dim = 1 : i32,    // Index along columns (dimension 1)
    begin = 0 : i32,  // Start from first column
    end = 3 : i32,    // End at third column (exclusive)
    step = 2 : i32    // Take every other column
} : tensor<3x4xf32>, tensor<3x2xf32> -> tensor<3x2xf32>
// Result:
// [[1, 3],
//  [5, 7],
//  [9, 11]]

Inputs:

input (Tensor): The input tensor to index.

Attributes:

dim (Integer): The dimension along which to index.
begin (Integer): The starting index.
end (Integer): The ending index (exclusive).
step (Integer): The step size between indices.

Outputs:

result (Tensor): The indexed tensor.

Note: The shape of the output tensor is the same as the input tensor except for the indexed dimension, which will have size ceil((end - begin) / step). The indices selected will be begin, begin + step, begin + 2*step, etc., up to but not including end.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dim`	::mlir::IntegerAttr	32-bit signless integer attribute
`begin`	::mlir::IntegerAttr	32-bit signless integer attribute
`end`	::mlir::IntegerAttr	32-bit signless integer attribute
`step`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.index_select` (tt::ttir::IndexSelectOp)

Tensor selection operation.

The index_select operation extracts a sub-tensor (slice) from the input tensor along a specified dimension.

Unlike the more general slice operation, index_select operates on a single dimension with a specified starting index, length, and optional stride. This is useful for extracting specific segments of a tensor along a particular axis.

Example:

// Select elements 2, 3, 4 from a 1D tensor along dimension 0
%input = ... : tensor<6xf32>  // Input tensor with values: [1, 2, 3, 4, 5, 6]
%output = ttir.empty() : tensor<3xf32>  // Output tensor shape
%result = ttir.index_select(%input, %output) {
    dim = 0 : i32,     // Dimension to select from
    begin = 2 : i32,   // Start index
    length = 3 : i32,  // Number of elements to select
    stride = 0 : i32   // No stride (consecutive elements)
} : tensor<6xf32>, tensor<3xf32> -> tensor<3xf32>
// Result: [3, 4, 5]

// Select every other row from a 2D tensor
%input = ... : tensor<4x3xf32>  // Input tensor with values:
                                // [[1, 2, 3],
                                //  [4, 5, 6],
                                //  [7, 8, 9],
                                //  [10, 11, 12]]
%output = ttir.empty() : tensor<2x3xf32>  // Output tensor shape
%result = ttir.index_select(%input, %output) {
    dim = 0 : i32,     // Select along rows
    begin = 0 : i32,   // Start from the first row
    length = 2 : i32,  // Select 2 rows
    stride = 2 : i32   // Select every other row
} : tensor<4x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// Result:
// [[1, 2, 3],
//  [7, 8, 9]]

Inputs:

input (Tensor): The input tensor to select from.

Attributes:

dim (Integer): The dimension along which to select elements.
begin (Integer): The starting index for selection.
length (Integer): The number of elements to select.
stride (Integer, default=0): The step size for selection. A value of 0 means no stride (consecutive elements).

Outputs:

result (Tensor): The selected tensor.

Note: The shape of the output tensor is the same as the input tensor except for the selected dimension, which will have size length. If stride is non-zero, the elements selected will be at indices begin, begin + stride, begin + 2*stride, etc., up to length elements.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dim`	::mlir::IntegerAttr	32-bit signed integer attribute
`begin`	::mlir::IntegerAttr	32-bit signed integer attribute
`length`	::mlir::IntegerAttr	32-bit signed integer attribute
`stride`	::mlir::IntegerAttr	32-bit signed integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.isfinite` (tt::ttir::IsFiniteOp)

Elementwise isfinite operation.

The isfinite operation checks if each element in the input tensor is finite (neither infinite nor NaN).

For each element, it returns a boolean value indicating whether the element is finite.

Example:

// Check if all elements in %input are finite
%result = ttir.isfinite(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, Inf, 4.5], ... ]
// Output tensor:
// [[true, true, false, true], ... ]

Mathematical definition: isfinite(x) = x ∈ ℝ

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.leaky_relu` (tt::ttir::LeakyReluOp)

Eltwise leaky relu operation.

The Leaky ReLU (Rectified Linear Unit) operation computes an element-wise activation function over its input tensor. It is defined as:

y = x if x > 0 y = parameter * x if x <= 0

where parameter is a small, user-defined constant that determines the slope for negative inputs.

Inputs:

input (Tensor): The input tensor to be activated.

Outputs:

output (Tensor): The tensor after applying the Leaky ReLU activation.

Attributes:

parameter (float): The slope for negative values.

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`parameter`	::mlir::FloatAttr	32-bit float attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.le` (tt::ttir::LessEqualOp)

Elementwise less than or equal to.

The le operation performs an elementwise less than or equal to comparison between two tensors.

For each pair of corresponding elements, it returns:

1 (true) if the left element is less than or equal to the right element
0 (false) if the left element is greater than the right element

Example:

// Compare elements for less than or equal to
%result = ttir.le(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[1, 1, 1, 0], ... ]  // 1 where less or equal, 0 where greater

// Example with integer tensors
%result = ttir.le(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [1, 1, 1]  // All elements are less or equal

Mathematical definition: less_equal(x, y) = x <= y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.lt` (tt::ttir::LessThanOp)

Elementwise less than.

The lt operation performs an elementwise less than comparison between two tensors.

For each pair of corresponding elements, it returns:

1 (true) if the left element is less than the right element
0 (false) if the left element is greater than or equal to the right element

Example:

// Compare elements for less than
%result = ttir.lt(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[0, 0, 0, 1], ... ]  // 1 where less, 0 where greater or equal

// Example with integer tensors
%result = ttir.lt(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [0, 0, 0]  // Only the last element is less

Mathematical definition: less_than(x, y) = x < y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.linear` (tt::ttir::LinearOp)

Linear transformation operation.

The linear operation performs a linear transformation by computing the matrix multiplication of tensors a and b with an optional addition of a bias tensor.

This operation is commonly used in neural networks to implement fully connected layers. It computes the matrix multiplication of the input tensor with a weight tensor and adds an optional bias.

Example:

// Linear transformation with bias
%a = ... : tensor<10x64x32xbf16>  // Input tensor: batch_size=10, sequence_length=64, input_dim=32
%b = ... : tensor<32x128xbf16>    // Weight tensor: input_dim=32, output_dim=128
%bias = ... : tensor<128xbf16>    // Bias tensor: output_dim=128
%output = ttir.empty() : tensor<10x64x128xbf16>  // Output tensor shape
%result = ttir.linear(%a, %b, %bias, %output) :
    tensor<10x64x32xbf16>, tensor<32x128xbf16>, tensor<128xbf16>, tensor<10x64x128xbf16> -> tensor<10x64x128xbf16>

// Linear transformation without bias
%a = ... : tensor<10x64x32xf32>  // Input tensor
%b = ... : tensor<32x128xf32>    // Weight tensor
%output = ttir.empty() : tensor<10x64x128xf32>  // Output tensor shape
%result = ttir.linear(%a, %b, %output) :
    tensor<10x64x32xf32>, tensor<32x128xf32>, tensor<10x64x128xf32> -> tensor<10x64x128xf32>

Inputs:

a (Tensor): The input tensor.
b (Tensor): The weight tensor.
bias (Optional Tensor): The bias tensor to add to the result of the matrix multiplication.

Attributes:

transpose_a (Boolean, default=false): Whether to transpose tensor a before multiplication.
transpose_b (Boolean, default=false): Whether to transpose tensor b before multiplication.

Outputs:

result (Tensor): The result of the linear transformation.

The operation computes: result = matmul(a, b) + bias

Note: The shapes of the tensors must be compatible for matrix multiplication. For a 3D input tensor with shape [batch_size, sequence_length, input_dim], the weight tensor should have shape [input_dim, output_dim], and the bias tensor should have shape [output_dim]. The resulting tensor will have shape [batch_size, sequence_length, output_dim].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`transpose_a`	::mlir::BoolAttr	bool attribute
`transpose_b`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`a`	ranked tensor of any type values
`b`	ranked tensor of any type values
`bias`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.log1p` (tt::ttir::Log1pOp)

Elementwise natural logarithm of one plus input operation.

The log1p operation computes the natural logarithm of one plus each element in the input tensor.

For each element x, it returns ln(1 + x). This operation is more accurate than computing log(1 + x) directly for x values close to zero, and it is defined for x > -1. For values less than or equal to -1, the behavior depends on the implementation (may return NaN or negative infinity).

Example:

// Compute log1p of all elements in %input
%result = ttir.log1p(%input, %output) : tensor<5xf32>, tensor<5xf32> -> tensor<5xf32>
// Input tensor:
// [0.0, -0.999, 7.0, 6.38905621, 15.0]
// Output tensor:
// [0.0, -6.90776825, 2.07944155, 2.0, 2.77258873]

// Example with small values where log1p is more accurate than log(1+x)
%result = ttir.log1p(%small_input, %small_output) : tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [1e-10, 1e-7, 1e-5]
// Output tensor:
// [1e-10, 1e-7, 1e-5]  // Approximately equal to the input for very small values

Mathematical definition: log1p(x) = ln(1 + x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.log` (tt::ttir::LogOp)

Elementwise natural logarithm operation.

The log operation computes the natural logarithm of each element in the input tensor.

For each element, it returns the natural logarithm (base e) of the value. This operation is defined only for positive values; the behavior for zero or negative inputs depends on the implementation (may return NaN, infinity, or other special values).

Example:

// Compute natural logarithm of all elements in %input
%result = ttir.log(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.0, 2.718, 7.389, 20.086], ... ]
// Output tensor:
// [[0.0, 1.0, 2.0, 3.0], ... ]

// Example with different values
%result = ttir.log(%float_input, %float_output) : tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [10.0, 100.0, 1000.0]
// Output tensor:
// [2.303, 4.605, 6.908]  // ln(10), ln(100), ln(1000)

Mathematical definition: log(x) = ln(x), where ln is the natural logarithm

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.logical_and` (tt::ttir::LogicalAndOp)

Elementwise logical and.

The logical_and operation performs an elementwise logical AND operation between two tensors.

For each pair of corresponding elements, it returns:

1 (true) if both elements are 1 (true)
0 (false) if at least one element is 0 (false)

Example:

// Logical AND operation
%result = ttir.logical_and(%lhs, %rhs, %output) : tensor<4x4xi1>, tensor<4x4xi1>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1, 0, 1, 0], ... ]
// %rhs: [[1, 1, 0, 1], ... ]
// Output tensor:
// [[1, 0, 0, 0], ... ]  // 1 where both are 1, 0 otherwise

// Example with integer tensors
%result = ttir.logical_and(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, 0, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [1, 0, 0]  // Only the first element is true

Mathematical definition: logical_and(x, y) = x && y

Traits: AlwaysSpeculatableImplTrait, TTIR_BinaryIdempotence, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.logical_not` (tt::ttir::LogicalNotOp)

Elementwise logical not operation.

The logical_not operation computes the logical negation of each element in the input tensor.

For each element, it returns a boolean value indicating whether the element is false (zero) or true (non-zero).

Example:

// Compute logical negation of all elements in %input
%result = ttir.logical_not(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.0, 4.5], ... ]
// Output tensor:
// [[false, false, true, false], ... ]

Mathematical definition: logical_not(x) = !x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.logical_or` (tt::ttir::LogicalOrOp)

Elementwise logical or.

The logical_or operation performs an elementwise logical OR operation between two tensors.

For each pair of corresponding elements, it returns:

1 (true) if at least one element is 1 (true)
0 (false) if both elements are 0 (false)

Example:

// Logical OR operation
%result = ttir.logical_or(%lhs, %rhs, %output) : tensor<4x4xi1>, tensor<4x4xi1>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1, 0, 1, 0], ... ]
// %rhs: [[1, 1, 0, 1], ... ]
// Output tensor:
// [[1, 1, 1, 1], ... ]  // 1 where at least one is 1, 0 otherwise

// Example with integer tensors
%result = ttir.logical_or(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, 0, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [1, 1, 1]  // All elements are true

Mathematical definition: logical_or(x, y) = x || y

Traits: AlwaysSpeculatableImplTrait, TTIR_BinaryIdempotence, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.logical_xor` (tt::ttir::LogicalXorOp)

Elementwise logical xor.

The logical_xor operation performs an elementwise logical XOR operation between two tensors.

For each pair of corresponding elements, it returns:

1 (true) if exactly one element is 1 (true)
0 (false) if both elements are 0 (false) or both are 1 (true)

Example:

// Logical XOR operation
%result = ttir.logical_xor(%lhs, %rhs, %output) : tensor<4x4xi1>, tensor<4x4xi1>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1, 0, 1, 0], ... ]
// %rhs: [[1, 1, 0, 1], ... ]
// Output tensor:
// [[0, 1, 1, 1], ... ]  // 1 where exactly one is 1, 0 otherwise

// Example with integer tensors
%result = ttir.logical_xor(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, 0, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [0, 1, 1]  // Only the last element is true

Mathematical definition: logical_xor(x, y) = x ^^ y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.matmul` (tt::ttir::MatmulOp)

Matrix multiplication operation.

The matmul operation computes the matrix multiplication of two tensors.

This operation performs matrix multiplication between tensors a and b. It supports optional transposition of either input tensor before multiplication. For 2D tensors, this computes the standard matrix product. For tensors with more dimensions, it applies batched matrix multiplication.

Example:

// Basic matrix multiplication of 2D tensors
%a = ... : tensor<3x4xf32>  // Matrix A with shape [3,4]
%b = ... : tensor<4x5xf32>  // Matrix B with shape [4,5]
%output = ttir.empty() : tensor<3x5xf32>  // Output matrix shape
%result = ttir.matmul(%a, %b, %output) :
    tensor<3x4xf32>, tensor<4x5xf32>, tensor<3x5xf32> -> tensor<3x5xf32>

// Batched matrix multiplication with transposition
%a = ... : tensor<2x3x4xf32>  // Batch of 2 matrices with shape [3,4]
%b = ... : tensor<2x5x4xf32>  // Batch of 2 matrices with shape [5,4]
%output = ttir.empty() : tensor<2x3x5xf32>  // Output shape
%result = ttir.matmul(%a, %b, %output) {
    transpose_a = false,  // Don't transpose A
    transpose_b = true    // Transpose B before multiplication
} : tensor<2x3x4xf32>, tensor<2x5x4xf32>, tensor<2x3x5xf32> -> tensor<2x3x5xf32>

Inputs:

a (Tensor): The first input tensor.
b (Tensor): The second input tensor.

Attributes:

transpose_a (Boolean, default=false): Whether to transpose tensor a before multiplication.
transpose_b (Boolean, default=false): Whether to transpose tensor b before multiplication.

Outputs:

result (Tensor): The result of the matrix multiplication.

Note: The inner dimensions of the input tensors must be compatible for matrix multiplication. If a has shape [..., m, k] and b has shape [..., k, n], then the result will have shape [..., m, n]. If transpose_a is true, then a is treated as having shape [..., k, m]. If transpose_b is true, then b is treated as having shape [..., n, k].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`transpose_a`	::mlir::BoolAttr	bool attribute
`transpose_b`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`a`	ranked tensor of any type values
`b`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.max` (tt::ttir::MaxOp)

Maximum reduction operation.

The max operation computes the maximum value of elements along specified dimensions of the input tensor.

This operation reduces the input tensor by finding the maximum value of all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the maximum is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example:

// Maximum along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2xf32>
%result = ttir.max(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xf32>, tensor<2xf32> -> tensor<2xf32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// [5.0, 6.0]  // Maximum of each row

// Maximum along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<3xf32>
%result = ttir.max(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// [4.0, 5.0, 6.0]  // Maximum of each column

// Maximum over all dimensions
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<f32>
%result = ttir.max(%input, %output) {keep_dim = false} : tensor<2x3xf32>, tensor<f32> -> tensor<f32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// 6.0  // Maximum of all elements

Note: When comparing with NaN values, NaN is typically not selected as the maximum value.

Mathematical definition: max(x, dim) = max(x[i]) for all i in dimension dim

Inputs:

input (Tensor): The input tensor.

Attributes:

keep_dim (Bool): Whether to keep the reduced dimensions or not.
dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.max_pool2d` (tt::ttir::MaxPool2dOp)

2D maximum pooling operation.

The max_pool2d operation applies a 2D maximum pooling over an input tensor composed of several input planes.

This operation performs downsampling by dividing the input into local regions and computing the maximum value of each region. It reduces the spatial dimensions (height and width) of an input tensor while preserving the batch and channel dimensions. This is commonly used in neural networks to reduce the spatial size of feature maps while retaining the most important features.

Example:

// Basic 2D max pooling with a 2x2 kernel and stride 1
%input = ... : tensor<1x3x3x1xf32>  // 3x3 input tensor with values:
                                    // [[[1, 2, 3],
                                    //   [4, 5, 6],
                                    //   [7, 8, 9]]]]
%output = ttir.empty() : tensor<1x2x2x1xf32>
%result = ttir.max_pool2d(%input, %output) {
    kernel = [2, 2],
    stride = [1, 1],
    dilation = [1, 1],
    padding = [0, 0, 0, 0],
    ceil_mode = false
} : tensor<1x3x3x1xf32>, tensor<1x2x2x1xf32> -> tensor<1x2x2x1xf32>
// Result: [[[5, 6],
//           [8, 9]]]]
// Where: 5 = max(1,2,4,5), 6 = max(2,3,5,6), 8 = max(4,5,7,8), 9 = max(5,6,8,9)

Inputs:

input (Tensor): Input tensor in NHWC format (batch, height, width, channels).

Attributes:

kernel (i32 | array<2xi32>):
- i32: Same kernel size for height and width dimensions (kH = kW = value).
- array<2xi32>: [kH, kW] where kH is kernel size for height and kW is kernel size for width.
stride (i32 | array<2xi32>):
- i32: Same stride for height and width dimensions (sH = sW = value).
- array<2xi32>: [sH, sW] where sH is stride for height and sW is stride for width.
dilation (i32 | array<2xi32>):
- i32: Same dilation for height and width dimensions (dH = dW = value).
- array<2xi32>: [dH, dW] where dH is dilation for height and dW is dilation for width.
padding (i32 | array<2xi32> | array<4xi32>):
- i32: Same padding for all sides (pT = pL = pB = pR = value).
- array<2xi32>: [pH, pW] where pH is padding for height (top/bottom) and pW is padding for width (left/right).
- array<4xi32>: [pT, pL, pB, pR] for top, left, bottom, and right padding respectively.
ceil_mode (Boolean): When true, uses ceil instead of floor for output shape calculation.

Outputs:

result (Tensor): Output tensor after maximum pooling.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`kernel`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`stride`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`dilation`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`padding`	::mlir::Attribute	32-bit signless integer attribute or i32 dense array attribute
`ceil_mode`	::mlir::BoolAttr	bool attribute
`flattened_compat_info`	::mlir::tt::ttir::FlattenedCompatInfoAttr	Information for sliding window operations with tensors flattened to (1, 1, NHW, C) {{% markdown %}} This attribute marks operations that are compatible with flattened tensors. It is used as a marker and doesn't carry any additional data. {{% /markdown %}}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.maximum` (tt::ttir::MaximumOp)

Elementwise maximum operation.

The maximum operation calculates the elementwise maximum between two tensors.

For each pair of corresponding elements, it selects the larger value and places it in the output tensor. This operation has the idempotence property, meaning that applying it twice with the same second operand returns the original result: maximum(maximum(x, y), y) = maximum(x, y).

Example:

// Maximum operation
%result = ttir.maximum(%lhs, %rhs, %output) : tensor<3x3xi32>, tensor<3x3xi32>, tensor<3x3xi32> -> tensor<3x3xi32>
// Input tensors:
// %lhs: [[3, 2, 7], [1, 4, 4]]
// %rhs: [[1, 4, 2], [1, 2, 3]]
// Output tensor:
// [[3, 4, 7], [1, 4, 4]]

Note: When comparing with NaN values, NaN is typically not selected as the maximum value.

Mathematical definition: maximum(x, y) = max(x, y)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.mean` (tt::ttir::MeanOp)

Mean reduction op.

The mean operation computes the arithmetic mean of elements along specified dimensions of the input tensor.

This operation reduces the input tensor by computing the average of all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the mean is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example:

// Mean along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2xf32>
%result = ttir.mean(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xf32>, tensor<2xf32> -> tensor<2xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [2.0, 5.0]  // Mean of each row

// Mean along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<3xf32>
%result = ttir.mean(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [2.5, 3.5, 4.5]  // Mean of each column

// Mean over all dimensions
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<f32>
%result = ttir.mean(%input, %output) {keep_dim = false} : tensor<2x3xf32>, tensor<f32> -> tensor<f32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// 3.5  // Mean of all elements

Note: For integer input tensors, the result is typically rounded to the nearest integer according to the rounding mode.

Mathematical definition: mean(x, dim) = (∑ x[i]) / n for all i in dimension dim, where n is the number of elements in dimension dim

Inputs:

input (Tensor): The input tensor.

Attributes:

keep_dim (Bool): Whether to keep the reduced dimensions or not.
dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.mesh_shard` (tt::ttir::MeshShardOp)

Mesh shard operation.

MeshShard op shards the inputs (FullToShard) or concatnates the outputs (ShardToFull) for ccl ops.

shard_direction attribute determines whether to shard or concat.

shard_type attribute determines how to shard or concat. manual: no sharding replicate: all devices have identical data maximal: only one device contains full data devices: shard_shape/shard_dims determine particular sharding

shard_dims attribute determines row and column sharding dimension of input tensor

For example, on 2x4 mesh hardware, following op shards arg0 to 8 slices, row divided by 2 and col divided by 4.

%1 = "ttir.mesh_shard"(%arg0, %0) < {... shard_direction = #ttcore.shard_direction<full_to_shard>, shard_shape = array<i64: 2, 4>, shard_dims = array<i64: 0, 1>, shard_type = #ttcore.shard_type}> : (tensor<8192x784xf32>, ...) -> tensor<4096x196xf32>

On the other hand, this op concatnates %4 to single tensor by concatnating one of the top row tensor with one of the bottom row tensor.

%6 = "ttir.mesh_shard"(%4, %5) < {..., shard_direction = #ttcore.shard_direction<shard_to_full>, shard_shape = array<i64: 2, 1>, shard_dims = arrray<i64: 1, -1>, shard_type = #ttcore.shard_type}> : (tensor<4096x16384xf32>, ...) -> tensor<8192x16384xf32>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`shard_type`	::mlir::tt::ttcore::MeshShardTypeAttr	MeshShard shard_type attribute in TT dialect {{% markdown %}} Define sharded tensor data of mesh_shard op. - Identity: input and output tensors are pre-sharded (same data) and no sharding is required. - Replicate: all of the devices has full tensor (same data). - Maximal: one or part of the devcices has full tensor (same data). - Devices: all or part of the devices has sharded (partial) tensor (different data). {{% /markdown %}}
`shard_direction`	::mlir::tt::ttcore::MeshShardDirectionAttr	TT MeshShardDirection
`shard_shape`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`shard_dims`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.min` (tt::ttir::MinOp)

Minimum reduction operation.

The min operation computes the minimum value of elements along specified dimensions of the input tensor.

This operation reduces the input tensor by finding the minimum value of all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the minimum is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example:

// Minimum along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2xf32>
%result = ttir.min(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xf32>, tensor<2xf32> -> tensor<2xf32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// [1.0, 2.0]  // Minimum of each row

// Minimum along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<3xf32>
%result = ttir.min(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// [1.0, 2.0, 3.0]  // Minimum of each column

// Minimum over all dimensions
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<f32>
%result = ttir.min(%input, %output) {keep_dim = false} : tensor<2x3xf32>, tensor<f32> -> tensor<f32>
// Input tensor:
// [[1.0, 5.0, 3.0],
//  [4.0, 2.0, 6.0]]
// Output tensor:
// 1.0  // Minimum of all elements

Note: When comparing with NaN values, NaN is typically not selected as the minimum value.

Mathematical definition: min(x, dim) = min(x[i]) for all i in dimension dim

Inputs:

input (Tensor): The input tensor.

Attributes:

keep_dim (Bool): Whether to keep the reduced dimensions or not.
dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.minimum` (tt::ttir::MinimumOp)

Elementwise minimum operation.

The minimum operation computes the elementwise minimum between two tensors.

For each pair of corresponding elements, it selects the smaller value and places it in the output tensor. This operation has the idempotence property, meaning that applying it twice with the same second operand returns the original result: minimum(minimum(x, y), y) = minimum(x, y).

Example:

// Minimum operation
%result = ttir.minimum(%lhs, %rhs, %output) : tensor<2x3xi32>, tensor<2x3xi32>, tensor<2x3xi32> -> tensor<2x3xi32>
// Input tensors:
// %lhs: [[3, 2, 7], [1, 4, 4]]
// %rhs: [[1, 4, 2], [1, 2, 3]]
// Output tensor:
// [[1, 2, 2], [1, 2, 3]]

// Example with floating point values
%result = ttir.minimum(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [3.5, -2.1, 0.0]
// %float_rhs: [1.2, -5.0, 0.0]
// Output tensor:
// [1.2, -5.0, 0.0]

Note: When comparing with NaN values, NaN is typically not selected as the minimum value.

Mathematical definition: minimum(x, y) = min(x, y)

Traits: AlwaysSpeculatableImplTrait, TTIR_BinaryIdempotence, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.multiply` (tt::ttir::MultiplyOp)

Elementwise multiplication operation.

The multiply operation performs an elementwise multiplication between two tensors.

For each pair of corresponding elements, it multiplies the elements and places the result in the output tensor.

Example:

// Multiplication operation
%result = ttir.multiply(%lhs, %rhs, %output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensors:
// %lhs: [10, 20, 30]
// %rhs: [1, 2, 3]
// Output tensor:
// [10, 40, 90]

// Example with floating point values
%result = ttir.multiply(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [3.5, 0.0, -1.2]
// %float_rhs: [1.5, 2.0, -3.2]
// Output tensor:
// [5.25, 0.0, -3.84]

Note: The data type of the output tensor matches the data type of the input tensors.

Mathematical definition: multiply(x, y) = x * y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.neg` (tt::ttir::NegOp)

Elementwise negate operation.

The neg operation negates each element in the input tensor.

For each element, it returns the negation of the value. The operation preserves the data type of the input.

Example:

// Compute negation of all elements in %input
%result = ttir.neg(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[-1.7, -2.0, 0.3, -4.5], ... ]

Mathematical definition: neg(x) = -x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Involution, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.ne` (tt::ttir::NotEqualOp)

Elementwise inequality comparison operation.

The ne operation performs an elementwise inequality comparison between two tensors.

For each pair of corresponding elements, it returns:

1 (true) if the elements are not equal
0 (false) if the elements are equal

Note that special handling may be required for floating-point NaN values, as NaN is not equal to any value, including itself. This means ne(NaN, NaN) should return true.

Example:

// Compare elements for inequality
%result = ttir.ne(%lhs, %rhs, %output) : tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xi1> -> tensor<4x4xi1>
// Input tensors:
// %lhs: [[1.0, 2.0, 3.0, 2.0], ... ]
// %rhs: [[1.0, 2.0, 4.0, 5.0], ... ]
// Output tensor:
// [[0, 0, 1, 1], ... ]  // 0 where equal, 1 where not equal

// Example with integer tensors
%result = ttir.ne(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi1> -> tensor<3xi1>
// Input tensors:
// %int_lhs: [10, -5, 0]
// %int_rhs: [10, 5, 1]
// Output tensor:
// [0, 1, 1]  // Only the first elements are equal, so their result is 0

Mathematical definition: not_equal(x, y) = x != y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.ones` (tt::ttir::OnesOp)

Creates a tensor filled with ones.

The ones operation creates a tensor filled with ones of the specified shape.

This operation is commonly used to initialize tensors with one values. It takes a shape attribute and produces a tensor of that shape with all elements set to one.

Example:

// Create a 3D tensor of ones with shape [64, 28, 28]
%result = ttir.ones() {
    shape = [64, 28, 28]
} : () -> tensor<64x28x28xbf16>
// Result: A tensor of shape [64, 28, 28] filled with ones

// Create a 2D tensor of ones with shape [3, 4]
%result = ttir.ones() {
    shape = [3, 4]
} : () -> tensor<3x4xf32>
// Result: [[1.0, 1.0, 1.0, 1.0],
//          [1.0, 1.0, 1.0, 1.0],
//          [1.0, 1.0, 1.0, 1.0]]

Attributes:

shape (Array of Integer): The shape of the tensor to create.

Outputs:

result (Tensor): The tensor filled with ones.

Note: The element type of the result tensor is determined by the return type specified in the operation. This operation is useful for initializing tensors before scaling them or as a starting point for operations that require tensors filled with ones, such as creating masks or constant multipliers.

Traits: AlwaysSpeculatableImplTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`shape`	::mlir::DenseI32ArrayAttr	i32 dense array attribute

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.pad` (tt::ttir::PadOp)

Tensor padding operation.

The pad operation adds padding to the edges of an input tensor with a specified constant value.

This operation extends the dimensions of the input tensor by adding padding elements with a constant value. The padding is specified for each dimension as the number of elements to add at the beginning (low) and end (high) of that dimension.

The padding attribute must be a sequence of integers that is twice the size as the rank of the input. Each pair of integers in the padding attribute represents the amount of padding to add to the low and high of that dimension. For example, for a 2D tensor, the padding attribute would have 4 values: [dim0_low, dim0_high, dim1_low, dim1_high].

Example:

// Pad a 2x3 tensor with different padding on each dimension
%input = ... : tensor<2x3xf32>  // Input tensor with values:
                                // [[1, 2, 3],
                                //  [4, 5, 6]]
%output = ttir.empty() : tensor<3x5xf32>  // Output tensor shape
%result = ttir.pad(%input, %output) {
    padding = [1, 0, 1, 1],  // Format: [dim0_low, dim0_high, dim1_low, dim1_high]
    value = 0.0 : f32
} : tensor<2x3xf32>, tensor<3x5xf32> -> tensor<3x5xf32>
// Result:
// [[0, 0, 0, 0, 0],
//  [0, 1, 2, 3, 0],
//  [0, 4, 5, 6, 0]]

Inputs:

input (Tensor): The input tensor to pad.

Attributes:

padding (Array of Integer): The padding values for each dimension, specified as [dim0_low, dim0_high, dim1_low, dim1_high, ...].
value (Float): The constant value to use for the padding elements.

Outputs:

result (Tensor): The padded tensor.

Note: The shape of the output tensor must match the shape of the input tensor plus the padding specified in the padding attribute. For example, if the input shape is [2,3] and the padding is [1,0,1,1], then the output shape must be [3,5].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`padding`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`value`	::mlir::FloatAttr	32-bit float attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.permute` (tt::ttir::PermuteOp)

Tensor dimension permutation operation.

The permute operation reorders the dimensions of the input tensor according to the specified permutation.

This operation is similar to transpose but generalizes to tensors of any rank. It rearranges the dimensions of the input tensor based on the permutation attribute, which specifies the new order of dimensions.

Example:

// Transpose a 2D tensor (swap dimensions 0 and 1)
%input = ... : tensor<3x4xf32>  // Input tensor with shape [3,4]
%output = ttir.empty() : tensor<4x3xf32>  // Output tensor shape
%result = ttir.permute(%input, %output) {
    permutation = [1, 0]  // Swap dimensions 0 and 1
} : tensor<3x4xf32>, tensor<4x3xf32> -> tensor<4x3xf32>
// Result: tensor with shape [4,3], equivalent to transposing the input

// Permute a 3D tensor
%input = ... : tensor<2x3x4xf32>  // Input tensor with shape [2,3,4]
%output = ttir.empty() : tensor<3x4x2xf32>  // Output tensor shape
%result = ttir.permute(%input, %output) {
    permutation = [1, 2, 0]  // Reorder dimensions to [1,2,0]
} : tensor<2x3x4xf32>, tensor<3x4x2xf32> -> tensor<3x4x2xf32>
// Result: tensor with shape [3,4,2]

Inputs:

input (Tensor): The input tensor to permute.

Attributes:

permutation (Array of Integer): The permutation of the input tensor dimensions. This must be a valid permutation of the indices [0, 1, ..., rank-1].

Outputs:

result (Tensor): The permuted tensor.

Note: The permutation attribute must contain exactly one occurrence of each integer in the range [0, rank-1], where rank is the number of dimensions in the input tensor. The shape of the output tensor is determined by permuting the dimensions of the input tensor according to the permutation. For example, if the input shape is [2,3,4] and the permutation is [1,2,0], then the output shape will be [3,4,2].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_TensorManipulation

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`permutation`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.pooling` (tt::ttir::PoolingOp)

General pooling operation.

The pooling operation is a generalized pooling operation that can implement various pooling methods such as max pooling, average pooling, and sum pooling.

Pooling operations are commonly used in neural networks to reduce the spatial dimensions of feature maps by applying a specific function (like maximum or average) over local regions of the input tensor.

Example:

// Max pooling with 2x2 window and stride 2
%input = ... : tensor<1x32x32x16xf32>    // Batch size 1, 32x32 feature map, 16 channels
%output = ttir.empty() : tensor<1x16x16x16xf32>  // Output tensor
%result = ttir.pooling(%input, %output) {
    pooling_method = "MAX",
    window_dimensions = [1, 2, 2, 1],
    window_strides = [1, 2, 2, 1],
    base_dilations = [1, 1, 1, 1],
    window_dilations = [1, 1, 1, 1],
    padding = [0, 0, 0, 0, 0, 0, 0, 0]
} : tensor<1x32x32x16xf32>, tensor<1x16x16x16xf32> -> tensor<1x16x16x16xf32>

// Average pooling with 3x3 window and stride 2
%input = ... : tensor<1x32x32x16xf32>    // Batch size 1, 32x32 feature map, 16 channels
%output = ttir.empty() : tensor<1x15x15x16xf32>  // Output tensor
%result = ttir.pooling(%input, %output) {
    pooling_method = "AVG",
    window_dimensions = [1, 3, 3, 1],
    window_strides = [1, 2, 2, 1],
    base_dilations = [1, 1, 1, 1],
    window_dilations = [1, 1, 1, 1],
    padding = [0, 0, 0, 0, 0, 0, 0, 0]
} : tensor<1x32x32x16xf32>, tensor<1x15x15x16xf32> -> tensor<1x15x15x16xf32>

Inputs:

inputs (Variadic Tensor): Input tensors to be pooled.

Attributes:

pooling_method (Enum): The pooling method to use (MAX, AVG, SUM).
window_dimensions (Array of Integer): Dimensions of the pooling window. The rank of this array is the same as the rank of the input tensors.
window_strides (Array of Integer): Stride of the pooling window. The rank of this array is the same as the rank of the input tensors.
base_dilations (Array of Integer): Dilation factors for the input. The rank of this array is the same as the rank of the input tensors.
window_dilations (Array of Integer): Dilation factors for the pooling window. The rank of this array is the same as the rank of the input tensors.
padding (Array of Integer): Padding to apply to the input. The rank of this array is twice the rank of the input tensors, representing padding for each dimension.

Outputs:

results (Variadic Tensor): Output tensors after pooling.

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), QuantizableOpInterface, TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`pooling_method`	::mlir::tt::ttir::PoolingMethodAttr	TTIR PoolingMethod
`window_dimensions`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`window_strides`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`base_dilations`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`window_dilations`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`padding`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

Operands:

Operand	Description
`inputs`	variadic of ranked tensor of any type values
`outputs`	variadic of ranked tensor of any type values

Results:

Result	Description
«unnamed»	variadic of ranked tensor of any type values

`ttir.pow` (tt::ttir::PowOp)

Elementwise power operation.

The pow operation performs an elementwise exponentiation between two tensors.

For each pair of corresponding elements, it raises the element in the first tensor (base) to the power of the element in the second tensor (exponent) and places the result in the output tensor.

Example:

// Power operation
%result = ttir.pow(%lhs, %rhs, %output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %lhs: [2.0, 3.0, 4.0]  // Bases
// %rhs: [2.0, 2.0, 0.5]  // Exponents
// Output tensor:
// [4.0, 9.0, 2.0]

// Example with integer values
%result = ttir.pow(%int_lhs, %int_rhs, %int_output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensors:
// %int_lhs: [2, 3, 5]
// %int_rhs: [3, 2, 1]
// Output tensor:
// [8, 9, 5]

Special cases:

0^0 is typically defined as 1
For integer types, negative bases with non-integer exponents may result in complex numbers, which are typically not supported and may result in undefined behavior

Mathematical definition: pow(x, y) = x^y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.prod` (tt::ttir::ProdOp)

Product reduction op.

The `prod` operation computes the product of elements along specified dimensions of the input tensor.

This operation reduces the input tensor by multiplying all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the product is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example:

// Product along dimension 0
%input = ... : tensor<2x3xi32>
%output = ttir.empty() : tensor<3xi32>
%result = ttir.prod(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensor:
// [[1, 2, 3],
//  [4, 5, 6]]
// Output tensor:
// [4, 10, 18]  // Product of each column

// Product along dimension 1
%input = ... : tensor<2x3xi32>
%output = ttir.empty() : tensor<2xi32>
%result = ttir.prod(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xi32>, tensor<2xi32> -> tensor<2xi32>
// Input tensor:
// [[1, 2, 3],
//  [4, 5, 6]]
// Output tensor:
// [6, 120]  // Product of each row

// Product over all dimensions
%input = ... : tensor<2x3xi32>
%output = ttir.empty() : tensor<i32>
%result = ttir.prod(%input, %output) {keep_dim = false} : tensor<2x3xi32>, tensor<i32> -> tensor<i32>
// Input tensor:
// [[1, 2, 3],
//  [4, 5, 6]]
// Output tensor:
// 720  // Product of all elements

Note: For floating-point inputs, the order of multiplication may affect the result due to floating-point precision issues.

Mathematical definition: prod(x, dim) = ∏ x[i] for all i in dimension dim

Inputs:

input (Tensor): The input tensor.

Attributes:

keep_dim (Bool): Whether to keep the reduced dimensions or not.
dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.quantize` (tt::ttir::QuantizeOp)

Quantize operation.

The Quantize operation converts a tensor into a quantized tensor using the quant.uniform type from the MLIR Quant dialect. This type encapsulates the scale and zero-point metadata directly within the tensor type. The output tensor will be of type 'quant.uniform', where each element is computed as:

output[i] = (input[i] / scale) + zero_point

Example:

%input = ttir.empty() : () -> tensor<64x128xf32>
%output = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%quantized = "ttir.quantize"(%input, %output) : (tensor<64x128xf32>, tensor<64x128x!quant.uniform<i32:f32, 0.1>>) -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>

// In this example:
// - The input is a 64x128 tensor of 32-bit floating-point values
// - The output is a 64x128 tensor of 32-bit quantized values
// - The scale is 0.1 (each step represents 0.1 in the original scale)
// - The zero point is 128 (the value 128 in the quantized space represents 0.0 in the original space)

Inputs:

input (Tensor): Input tensor to be quantized.

Results:

result (Quantized Tensor): The quantized tensor with type quant.uniform.

Note: The quantization parameters (scale and zero point) are specified in the result type. Quantization helps reduce model size and computational requirements by representing floating-point values with lower-precision integers, which is particularly useful for deployment on resource-constrained devices.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.quantize_unrolled` (tt::ttir::QuantizeUnrolledOp)

Quantize operation unrolled (scale and zero point as input operands).

The QuantizeUnrolledOp quantizes a tensor using the scale and zero point provided as input operands.

Inputs:

input AnyRankedTensor: The input tensor to be quantized. Must have floating-point element type.
scale AnyRankedTensor: The scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
zero_point AnyRankedTensor: The zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`axis`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`scale`	ranked tensor of any type values
`zero_point`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.rand` (tt::ttir::RandOp)

Random number generation operation.

Returns a tensor filled with random numbers drawn from a uniform distribution over given interval [low, high) [Default: [0, 1)].

Example: %0 = "ttir.rand"() <{dtype = bf16, high = 1.000000e+00 : f32, low = 0.000000e+00 : f32, seed = 0 : ui32, size = [32 : i32, 32 : i32]}> : () -> tensor<32x32xbf16>

Attributes:

size (Array of Integer): The shape of the tensor to create.
dtype (mlir:Type): Data type of the returned tensor.
low (Float): The lower bound of the range (inclusive) [Default: 0.0].
high (Float): The upper bound of the range (exclusive) [Default: 1.0].
seed (Integer): Value to initialize the random number generator for reproducible results [Default: 0].

Outputs:

result (Tensor): The generated tensor containing the random values.

Traits: AlwaysSpeculatableImplTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`size`	::mlir::ArrayAttr	32-bit integer array attribute
`dtype`	::mlir::TypeAttr	any type attribute
`low`	::mlir::FloatAttr	32-bit float attribute
`high`	::mlir::FloatAttr	32-bit float attribute
`seed`	::mlir::IntegerAttr	32-bit unsigned integer attribute

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.reciprocal` (tt::ttir::ReciprocalOp)

Eltwise reciprocal.

The reciprocal operation computes the reciprocal (1/x) of each element in the input tensor.

For each element, it returns the reciprocal of the value.

Example:

// Compute reciprocal of all elements in %input
%result = ttir.reciprocal(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.5882, 0.5, -3.3333, 0.2173], ... ]

Mathematical definition: reciprocal(x) = 1 / x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Involution, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.reduce_and` (tt::ttir::ReduceAndOp)

Logical AND reduction operation.

The reduce_and operation performs a logical AND reduction along specified dimensions of the input tensor.

This operation reduces the input tensor by applying a logical AND operation to all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the reduction is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

The operation treats non-zero values as True and zero values as False when performing the logical AND.

Example:

// Logical AND reduction along dimension 0
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<4xi1>
%result = ttir.reduce_and(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<4x4xi1>, tensor<4xi1> -> tensor<4xi1>
// Input tensor (where 1 represents True and 0 represents False):
// [[1, 0, 1, 0],
//  [1, 1, 1, 1],
//  [0, 0, 1, 1],
//  [0, 1, 1, 0]]
// Output tensor:
// [0, 0, 1, 0]  // Logical AND of each column

// Logical AND reduction along dimension 1
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<4xi1>
%result = ttir.reduce_and(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<4x4xi1>, tensor<4xi1> -> tensor<4xi1>
// Input tensor:
// [[1, 0, 1, 0],
//  [1, 1, 1, 1],
//  [0, 0, 1, 1],
//  [0, 1, 1, 0]]
// Output tensor:
// [0, 1, 0, 0]  // Logical AND of each row

// Logical AND reduction over all dimensions
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<i1>
%result = ttir.reduce_and(%input, %output) {keep_dim = false} : tensor<4x4xi1>, tensor<i1> -> tensor<i1>
// Input tensor:
// [[1, 0, 1, 0],
//  [1, 1, 1, 1],
//  [0, 0, 1, 1],
//  [0, 1, 1, 0]]
// Output tensor:
// 0  // Logical AND of all elements

Mathematical definition: reduce_and(x, dim) = AND(x[i]) for all i in dimension dim

Inputs:

input (Tensor): The input tensor.

Attributes:

keep_dim (Bool): Whether to keep the reduced dimensions or not.
dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.reduce_or` (tt::ttir::ReduceOrOp)

Logical OR reduction operation.

The reduce_or operation performs a logical OR reduction along specified dimensions of the input tensor.

This operation reduces the input tensor by applying a logical OR operation to all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the reduction is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

The operation treats non-zero values as True and zero values as False when performing the logical OR.

Example:

// Logical OR reduction along dimension 0
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<4xi1>
%result = ttir.reduce_or(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<4x4xi1>, tensor<4xi1> -> tensor<4xi1>
// Input tensor (where 1 represents True and 0 represents False):
// [[1, 0, 0, 0],
//  [1, 1, 0, 1],
//  [0, 0, 0, 1],
//  [0, 0, 0, 0]]
// Output tensor:
// [1, 1, 0, 1]  // Logical OR of each column

// Logical OR reduction along dimension 1
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<4xi1>
%result = ttir.reduce_or(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<4x4xi1>, tensor<4xi1> -> tensor<4xi1>
// Input tensor:
// [[1, 0, 0, 0],
//  [1, 1, 0, 1],
//  [0, 0, 0, 1],
//  [0, 0, 0, 0]]
// Output tensor:
// [1, 1, 1, 0]  // Logical OR of each row

// Logical OR reduction over all dimensions
%input = ... : tensor<4x4xi1>
%output = ttir.empty() : tensor<i1>
%result = ttir.reduce_or(%input, %output) {keep_dim = false} : tensor<4x4xi1>, tensor<i1> -> tensor<i1>
// Input tensor:
// [[1, 0, 0, 0],
//  [1, 1, 0, 1],
//  [0, 0, 0, 1],
//  [0, 0, 0, 0]]
// Output tensor:
// 1  // Logical OR of all elements

Mathematical definition: reduce_or(x, dim) = OR(x[i]) for all i in dimension dim

Inputs:

input (Tensor): The input tensor.

Attributes:

keep_dim (Bool): Whether to keep the reduced dimensions or not.
dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.reduce_scatter` (tt::ttir::ReduceScatterOp)

Reduce scatter operation.

Reduce scatter op.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`reduce_type`	::mlir::tt::ttcore::ReduceTypeAttr	TT Reduce Type
`scatter_dim`	::mlir::IntegerAttr	32-bit signed integer attribute
`cluster_axis`	::mlir::IntegerAttr	32-bit unsigned integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.relu` (tt::ttir::ReluOp)

Eltwise ReLU.

The relu operation computes the rectified linear unit (ReLU) of each element in the input tensor.

For each element, it returns the maximum of 0 and the value. The operation preserves the data type of the input.

Example:

// Compute ReLU of all elements in %input
%result = ttir.relu(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[1.7, 2.0, 0.0, 4.5], ... ]

Mathematical definition: relu(x) = max(0, x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Idempotence, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.remainder` (tt::ttir::RemainderOp)

Elementwise remainder operation.

The remainder operation performs an elementwise remainder (modulo) operation between two tensors.

For each pair of corresponding elements, it computes the remainder when dividing the element in the first tensor (dividend) by the element in the second tensor (divisor) and places the result in the output tensor.

Example:

// Remainder operation
%result = ttir.remainder(%lhs, %rhs, %output) : tensor<4xi64>, tensor<4xi64>, tensor<4xi64> -> tensor<4xi64>
// Input tensors:
// %lhs: [17, -17, 17, -17]  // Dividends
// %rhs: [3, 3, -3, -3]      // Divisors
// Output tensor:
// [2, -2, 2, -2]

// Example with floating point values
%result = ttir.remainder(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [10.5, -10.5, 3.0]
// %float_rhs: [3.0, 3.0, 2.0]
// Output tensor:
// [1.5, -1.5, 1.0]

Note: Division by zero typically results in undefined behavior or NaN for floating-point types.

Mathematical definition: remainder(x, y) = x % y (where % is the remainder operator)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.repeat_interleave` (tt::ttir::RepeatInterleaveOp)

Tensor repeat interleave operation.

The repeat_interleave operation repeats elements of a tensor along a specified dimension.

Unlike the repeat operation which repeats the entire tensor, this operation repeats each individual element of the input tensor the specified number of times along the given dimension. This creates an interleaved pattern of repeated values.

Example:

// Repeat interleave along dimension 0 with repeats=2
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<4x3xf32>
%result = ttir.repeat_interleave(%input, %output) {repeats = 2 : ui32, dim = 0 : i32} :
    tensor<2x3xf32>, tensor<4x3xf32> -> tensor<4x3xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [[1.0, 2.0, 3.0],  // First row repeated
//  [1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0],  // Second row repeated
//  [4.0, 5.0, 6.0]]

// Repeat interleave along dimension 1 with repeats=3
%input = ... : tensor<2x2xf32>
%output = ttir.empty() : tensor<2x6xf32>
%result = ttir.repeat_interleave(%input, %output) {repeats = 3 : ui32, dim = 1 : i32} :
    tensor<2x2xf32>, tensor<2x6xf32> -> tensor<2x6xf32>
// Input tensor:
// [[1.0, 2.0],
//  [3.0, 4.0]]
// Output tensor:
// [[1.0, 1.0, 1.0, 2.0, 2.0, 2.0],  // Each element repeated 3 times
//  [3.0, 3.0, 3.0, 4.0, 4.0, 4.0]]

Inputs:

input (Tensor): The input tensor.

Attributes:

repeats (Integer): The number of times to repeat each element.
dim (Integer): The dimension along which to repeat elements.

Outputs:

result (Tensor): The tensor with repeated elements.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`repeats`	::mlir::IntegerAttr	32-bit unsigned integer attribute
`dim`	::mlir::IntegerAttr	32-bit signed integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.repeat` (tt::ttir::RepeatOp)

Repeat operation.

The repeat operation creates a new tensor by replicating the input tensor's elements along specified dimensions.

This operation repeats the entire input tensor along each dimension according to the values specified in the repeat_dimensions attribute. The resulting tensor's shape is the product of the input tensor's shape and the corresponding repeat values.

Example:

// Repeat a 2x3 tensor with repeat dimensions [2, 2]
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<4x6xf32>
%result = ttir.repeat(%input, %output) {repeat_dimensions = [2, 2]} :
    tensor<2x3xf32>, tensor<4x6xf32> -> tensor<4x6xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [[1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0, 4.0, 5.0, 6.0],
//  [1.0, 2.0, 3.0, 1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0, 4.0, 5.0, 6.0]]

// Repeat a 2x2 tensor with repeat dimensions [1, 3]
%input = ... : tensor<2x2xf32>
%output = ttir.empty() : tensor<2x6xf32>
%result = ttir.repeat(%input, %output) {repeat_dimensions = [1, 3]} :
    tensor<2x2xf32>, tensor<2x6xf32> -> tensor<2x6xf32>
// Input tensor:
// [[1.0, 2.0],
//  [3.0, 4.0]]
// Output tensor:
// [[1.0, 2.0, 1.0, 2.0, 1.0, 2.0],
//  [3.0, 4.0, 3.0, 4.0, 3.0, 4.0]]

Inputs:

input (Tensor): The input tensor to repeat.

Attributes:

repeat_dimensions (Array of Integer): The number of times to repeat the tensor along each dimension.

Outputs:

result (Tensor): The repeated tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`repeat_dimensions`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.requantize` (tt::ttir::RequantizeOp)

Requantize operation.

The Requantize operation converts a quantized tensor from one scale and zero-point to another, using the quant.uniform type from the MLIR Quant dialect. The input tensor is expected to be of type quant.uniform. The output tensor will also be of type quant.uniform. Each element in the output tensor is computed as:

output[i] = round((input[i] - input_zero_point) * (input_scale / output_scale)) + output_zero_point

Example:

%input = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%output = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.2>>
%requantized = "ttir.requantize"(%input, %output) : (tensor<64x128x!quant.uniform<i32:f32, 0.1>, tensor<64x128x!quant.uniform<i32:f32, 0.2>>) -> tensor<64x128x!quant.uniform<i32:f32, 0.2>>

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.requantize_unrolled` (tt::ttir::RequantizeUnrolledOp)

Requantize operation unrolled (scale and zero point as input operands).

The RequantizeUnrolledOp requantizes a tensor using the scale and zero point provided as input operands.

Inputs:

input AnyRankedTensor: The input tensor to be requantized. Must have quantized element type.
in_scale AnyRankedTensor: The input scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
in_zero_point AnyRankedTensor: The input zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
out_scale AnyRankedTensor: The output scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
out_zero_point AnyRankedTensor: The output zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`axis`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`in_scale`	ranked tensor of any type values
`in_zero_point`	ranked tensor of any type values
`out_scale`	ranked tensor of any type values
`out_zero_point`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.reshape` (tt::ttir::ReshapeOp)

Tensor reshape operation.

The reshape operation changes the shape of a tensor without changing the data or number of elements.

This operation takes an input tensor and reshapes it to a new shape specified by the shape attribute. The total number of elements in the tensor must remain the same after reshaping. This operation is commonly used in neural networks to change the dimensionality of tensors between layers.

Example:

// Reshape a 2x3 tensor to a 1x6 tensor
%input = ... : tensor<2x3xf32>  // Input tensor with shape [2,3]
%output = ttir.empty() : tensor<1x6xf32>  // Output tensor with shape [1,6]
%result = ttir.reshape(%input, %output) {shape = [1, 6]} :
    tensor<2x3xf32>, tensor<1x6xf32> -> tensor<1x6xf32>

// Reshape a 3D tensor to a 2D tensor
%input = ... : tensor<2x3x4xf32>  // Input tensor with shape [2,3,4]
%output = ttir.empty() : tensor<6x4xf32>  // Output tensor with shape [6,4]
%result = ttir.reshape(%input, %output) {shape = [6, 4]} :
    tensor<2x3x4xf32>, tensor<6x4xf32> -> tensor<6x4xf32>

Inputs:

input (Tensor): The input tensor to reshape.

Attributes:

shape (Array of Integer): The new shape for the tensor.

Outputs:

result (Tensor): The reshaped tensor.

Note: The total number of elements in the input tensor must equal the total number of elements in the output tensor. For example, a tensor of shape [2,3] (6 elements) can be reshaped to [1,6], [6,1], [2,1,3], etc., but not to [2,4] (8 elements).

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_TensorManipulation

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`shape`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.reverse` (tt::ttir::ReverseOp)

Tensor reversal operation.

The reverse operation reverses the order of elements in the input tensor along the specified dimensions.

This operation flips the elements of a tensor along one or more axes, which is useful for operations like sequence reversal, matrix transposition with reversal, and other tensor manipulations that require changing the order of elements.

Example:

// Reverse a 3x2 tensor along dimension 1 (columns)
%input = ... : tensor<3x2xi32>  // Input tensor with values:
                                // [[1, 2],
                                //  [3, 4],
                                //  [5, 6]]
%output = ttir.empty() : tensor<3x2xi32>  // Output tensor shape
%result = ttir.reverse(%input, %output) {
    dimensions = [1]  // Reverse along columns
} : tensor<3x2xi32>, tensor<3x2xi32> -> tensor<3x2xi32>
// Result:
// [[2, 1],
//  [4, 3],
//  [6, 5]]

// Reverse a 3x2 tensor along both dimensions
%input = ... : tensor<3x2xi64>  // Input tensor with values:
                                // [[1, 2],
                                //  [3, 4],
                                //  [5, 6]]
%output = ttir.empty() : tensor<3x2xi64>  // Output tensor shape
%result = ttir.reverse(%input, %output) {
    dimensions = [0, 1]  // Reverse along both rows and columns
} : tensor<3x2xi64>, tensor<3x2xi64> -> tensor<3x2xi64>
// Result:
// [[6, 5],
//  [4, 3],
//  [2, 1]]

Inputs:

input (Tensor): The input tensor to reverse.

Attributes:

dimensions (Array of Integer): The dimensions along which to reverse the tensor.

Outputs:

result (Tensor): The reversed tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dimensions`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.rsqrt` (tt::ttir::RsqrtOp)

Eltwise reciprocal square root.

The rsqrt operation computes the reciprocal square root of each element in the input tensor.

For each element, it returns the reciprocal of the square root of the value.

Example:

// Compute reciprocal square root of all elements in %input
%result = ttir.rsqrt(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.5882, 0.5, -3.3333, 0.2173], ... ]

Mathematical definition: rsqrt(x) = 1 / sqrt(x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.scatter` (tt::ttir::ScatterOp)

Scatter operation

The scatter operation updates slices of an input tensor at indices specified by scatter_indices with values from the update tensor.

This operation is the inverse of the gather operation. It allows for updating specific slices of a tensor at locations determined by indices. The operation is highly configurable through various dimension attributes that control how the indices and updates are interpreted.

Example:

// Basic scatter example: update values at specific indices in a 1D tensor
%input = ... : tensor<8xf32>        // Input tensor with values: [0, 0, 0, 0, 0, 0, 0, 0]
%indices = ... : tensor<3xi32>      // Indices tensor with values: [1, 3, 5]
%update = ... : tensor<3xf32>       // Update tensor with values: [10, 30, 50]
%output = ttir.empty() : tensor<8xf32>  // Output tensor shape
%result = ttir.scatter(%input, %indices, %update, %output) {
    update_window_dims = [],        // No window dimensions in update tensor
    inserted_window_dims = [0],     // Insert window dimension 0
    input_batching_dims = [],       // No batching dimensions in input
    scatter_indices_batching_dims = [], // No batching dimensions in indices
    scatter_dims_to_operand_dims = [0], // Map scatter dimension 0 to operand dimension 0
    index_vector_dim = 0,           // Indices are in dimension 0
    indices_are_sorted = true,      // Indices are sorted
    unique_indices = true           // Indices are unique
} : tensor<8xf32>, tensor<3xi32>, tensor<3xf32>, tensor<8xf32> -> tensor<8xf32>
// Result: [0, 10, 0, 30, 0, 50, 0, 0]

// Scatter to update a 2D tensor
%input = ... : tensor<4x4xf32>      // Input tensor (4x4 matrix of zeros)
%indices = ... : tensor<2x2xi32>    // Indices tensor with values: [[0, 1], [2, 3]]
%update = ... : tensor<2xf32>       // Update tensor with values: [100, 200]
%output = ttir.empty() : tensor<4x4xf32>  // Output tensor shape
%result = ttir.scatter(%input, %indices, %update, %output) {
    update_window_dims = [],
    inserted_window_dims = [0, 1],
    input_batching_dims = [],
    scatter_indices_batching_dims = [0],
    scatter_dims_to_operand_dims = [0, 1],
    index_vector_dim = 1,
    indices_are_sorted = false,
    unique_indices = true
} : tensor<4x4xf32>, tensor<2x2xi32>, tensor<2xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Result: A 4x4 tensor with 100 at position [0,1] and 200 at position [2,3]

Inputs:

input (Tensor): The tensor to be updated.
scatter_indices (Tensor): Tensor containing the starting indices for slices to update.
update (Tensor): Tensor containing values to scatter into the input tensor.

Attributes:

update_window_dims (Array of Integer): Dimensions in update that are window dimensions.
inserted_window_dims (Array of Integer): Dimensions in the output that are not present in update.
input_batching_dims (Array of Integer): Batch dimensions in the input tensor.
scatter_indices_batching_dims (Array of Integer): Batch dimensions in the scatter indices tensor.
scatter_dims_to_operand_dims (Array of Integer): Maps dimensions in scatter indices to dimensions in operand.
index_vector_dim (Integer): The dimension in scatter indices that contains the index vector.
indices_are_sorted (Boolean): Whether indices are sorted lexicographically.
unique_indices (Boolean): Whether indices are guaranteed to be unique.

Outputs:

result (Tensor): The updated tensor.

Note: The semantics of this operation are complex and based on the StableHLO scatter operation. The configuration of the various dimension attributes determines exactly how the scatter indices are interpreted and how the update values are applied to the input tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`update_window_dims`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`inserted_window_dims`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`input_batching_dims`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`scatter_indices_batching_dims`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`scatter_dims_to_operand_dims`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`index_vector_dim`	::mlir::IntegerAttr	32-bit signless integer attribute
`indices_are_sorted`	::mlir::BoolAttr	bool attribute
`unique_indices`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`scatter_indices`	ranked tensor of any type values
`update`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.sigmoid` (tt::ttir::SigmoidOp)

Eltwise sigmoid.

The sigmoid operation computes the sigmoid of each element in the input tensor.

For each element, it returns the sigmoid of the value.

Example:

// Compute sigmoid of all elements in %input
%result = ttir.sigmoid(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.8391, 0.9641, 0.5793, 0.9899], ... ]

Mathematical definition: sigmoid(x) = 1 / (1 + exp(-x))

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.sign` (tt::ttir::SignOp)

Eltwise sign operation.

The sign operation computes the sign of each element in the input tensor.

For each element, it returns:

1 if the value is positive
0 if the value is zero
-1 if the value is negative

This operation has the idempotence property, meaning that applying it multiple times produces the same result as applying it once: sign(sign(x)) = sign(x).

Example:

// Compute sign of all elements in %input
%result = ttir.sign(%input, %output) : tensor<2x3xi32>, tensor<2x3xi32> -> tensor<2x3xi32>
// Input tensor:
// [[3, -2, 0],
//  [1, -4, 4]]
// Output tensor:
// [[1, -1, 0],
//  [1, -1, 1]]

// Example with floating-point values
%result = ttir.sign(%float_input, %float_output) : tensor<4xf32>, tensor<4xf32> -> tensor<4xf32>
// Input tensor:
// [5.7, -0.0, 0.001, -3.14]
// Output tensor:
// [1.0, 0.0, 1.0, -1.0]

Mathematical definition: sign(x) = { 1 if x > 0 0 if x = 0 -1 if x < 0 }

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TTIR_Idempotence, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.sin` (tt::ttir::SinOp)

Eltwise sin operation.

The sin operation computes the sine of each element in the input tensor.

For each element, it returns the sine of the angle in radians.

Example:

// Compute sine of all elements in %input
%result = ttir.sin(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.9601, 0.5403, -0.3, 4.5], ... ]

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.slice` (tt::ttir::SliceOp)

Tensor slice operation.

The slice operation extracts a sub-tensor (slice) from the input tensor across one or more dimensions.

This operation selects a subset of elements from the input tensor based on the specified begin, end, and step indices for each dimension. It's similar to Python's slicing notation tensor[begin:end:step] but extended to multiple dimensions.

Example:

// Extract a 2x2 slice from a 4x4 tensor
%input = ... : tensor<4x4xf32>  // Input tensor with values:
                                // [[1,  2,  3,  4],
                                //  [5,  6,  7,  8],
                                //  [9,  10, 11, 12],
                                //  [13, 14, 15, 16]]
%output = ttir.empty() : tensor<2x2xf32>  // Output tensor shape
%result = ttir.slice(%input, %output) {
    begins = [1, 1],  // Start indices for each dimension
    ends = [3, 3],    // End indices for each dimension (exclusive)
    step = [1, 1]     // Step size for each dimension
} : tensor<4x4xf32>, tensor<2x2xf32> -> tensor<2x2xf32>
// Result:
// [[6,  7],
//  [10, 11]]

// Extract elements with a step of 2
%input = ... : tensor<5xf32>  // Input tensor with values: [1, 2, 3, 4, 5]
%output = ttir.empty() : tensor<3xf32>  // Output tensor shape
%result = ttir.slice(%input, %output) {
    begins = [0],  // Start index
    ends = [5],    // End index (exclusive)
    step = [2]     // Step size
} : tensor<5xf32>, tensor<3xf32> -> tensor<3xf32>
// Result: [1, 3, 5]

Inputs:

input (Tensor): The input tensor to slice.

Attributes:

begins (Array of Integer): The starting indices for the slice in each dimension.
ends (Array of Integer): The ending indices (exclusive) for the slice in each dimension.
step (Array of Integer): The step sizes for the slice in each dimension.

Outputs:

result (Tensor): The sliced tensor.

Note: The shape of the output tensor is determined by the slice parameters. For each dimension i, the output size is calculated as ceil((ends[i] - begins[i]) / step[i]). The begins, ends, and step arrays must have the same length as the rank of the input tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`begins`	::mlir::ArrayAttr	32-bit integer array attribute
`ends`	::mlir::ArrayAttr	32-bit integer array attribute
`step`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.softmax` (tt::ttir::SoftmaxOp)

Softmax normalization operation.

The softmax operation applies the softmax function along a specified dimension of the input tensor.

The softmax function transforms each element of the input tensor to a value between 0 and 1, such that the sum of all elements along the specified dimension equals 1. This is commonly used to convert a vector of real numbers into a probability distribution.

The softmax function is defined as: softmax(x_i) = exp(x_i) / sum(exp(x_j)) for all j in the specified dimension

Example:

// Softmax along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2x3xf32>
%result = ttir.softmax(%input, %output) {dimension = 1 : i32} : tensor<2x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 1.0, 2.0]]
// Output tensor (approximate values):
// [[0.09, 0.24, 0.67],  // sum = 1.0
//  [0.71, 0.09, 0.20]]  // sum = 1.0

Note: For numerical stability, the implementation typically subtracts the maximum value in each slice before applying the exponential function.

Inputs:

input (Tensor): The input tensor.

Attributes:

dimension (Integer): The dimension along which to apply the softmax function.

Outputs:

result (Tensor): The tensor after applying the softmax function.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dimension`	::mlir::IntegerAttr	32-bit signed integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.sort` (tt::ttir::SortOp)

Sort operation.

Sorts elements of a tensor along a given dimension.

Input:

input: AnyRankedTensor

Attributes:

dim (int32): The dimension to sort along (default: -1, the last dim).
descending (bool): If True, sort in descending order (default: False).
stable (bool): If True, ensures stable sort (equal elements keep order).

Returns a tuple:

values: the sorted tensor.
indices: the original indices of the sorted values.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dim`	::mlir::IntegerAttr	32-bit signed integer attribute
`descending`	::mlir::BoolAttr	bool attribute
`stable`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	variadic of ranked tensor of any type values

Results:

Result	Description
`values`	ranked tensor of any type values
`indices`	ranked tensor of any type values

`ttir.sqrt` (tt::ttir::SqrtOp)

Eltwise square root.

The sqrt operation computes the square root of each element in the input tensor.

For each element, it returns the square root of the value.

Example:

// Compute square root of all elements in %input
%result = ttir.sqrt(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.5882, 0.5, -3.3333, 0.2173], ... ]

Mathematical definition: sqrt(x) = √x

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.squeeze` (tt::ttir::SqueezeOp)

Tensor dimension squeezing operation.

The squeeze operation removes a dimension of size 1 from the shape of a tensor.

This operation is commonly used to eliminate unnecessary singleton dimensions from a tensor's shape. It specifies which dimension to remove using the dim attribute. The specified dimension must have size 1.

Example:

// Squeeze dimension 0 from a tensor of shape [1, 3, 4]
%input = ... : tensor<1x3x4xf32>  // Input tensor with shape [1, 3, 4]
%output = ttir.empty() : tensor<3x4xf32>  // Output tensor shape
%result = ttir.squeeze(%input, %output) {
    dim = 0 : i32  // Dimension to squeeze
} : tensor<1x3x4xf32>, tensor<3x4xf32> -> tensor<3x4xf32>
// Result: tensor with shape [3, 4]

// Squeeze dimension 1 from a tensor of shape [2, 1, 3]
%input = ... : tensor<2x1x3xf32>  // Input tensor with shape [2, 1, 3]
%output = ttir.empty() : tensor<2x3xf32>  // Output tensor shape
%result = ttir.squeeze(%input, %output) {
    dim = 1 : i32  // Dimension to squeeze
} : tensor<2x1x3xf32>, tensor<2x3xf32> -> tensor<2x3xf32>
// Result: tensor with shape [2, 3]

Inputs:

input (Tensor): The input tensor to squeeze.

Attributes:

dim (Integer): The dimension to squeeze.

Outputs:

result (Tensor): The squeezed tensor.

Note: The specified dimension must have size 1. The shape of the output tensor is the same as the input tensor with the specified dimension removed. For example, squeezing dimension 1 of a tensor with shape [2, 1, 3] results in a tensor with shape [2, 3].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dim`	::mlir::IntegerAttr	32-bit signed integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.stream_layout` (tt::ttir::StreamLayoutOp)

Stream Layout op.

StreamLayout operation used to form a stream between remote and local memory spaces. Note that this op has no side-effects, it's purely representational. The primary use cases include, to enable streaming a large tensor out of dram via a small L1 buffer and also as a means for forming reduce or gather multicast operations. A stream definition includes:

The tensor to be streamed.
The storage buffer to be used for streaming.
A result, which is also able to take a view over the input, i.e. same semantics as the ViewLayout op.

Additional constraints:

It is not capable of changing the data type nor the memory space of the tensor.

%input = memref.alloc() {alignment = 64 : i64} : memref<2x4x4x6x!ttcore.tile<32x32, f32>, #l1_>
%storage = memref.alloc() {alignment = 64 : i64} : memref<2x4x1x1x!ttcore.tile<32x32, f32>, #l1_>
%stream = "ttir.stream_layout"(%input, %storage) : (memref<2x4x4x6x!ttcore.tile<32x32, f32>, #l1_>, memref<2x4x1x1x!ttcore.tile<32x32, f32>, #l1_>) -> memref<2x4x4x6x!ttcore.tile<32x32, f32>, #ttcore.view<map(4)>, #l1_>

Traits: AlwaysSpeculatableImplTrait

Interfaces: BufferizableOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, TTIR_ViewOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values or non-0-ranked.memref of any type values
`storage`	ranked tensor of any type values or non-0-ranked.memref of any type values

Results:

Result	Description
`result`	ranked tensor of any type values or non-0-ranked.memref of any type values

`ttir.subtract` (tt::ttir::SubtractOp)

Elementwise subtract operation.

The subtract operation performs an elementwise subtraction between two tensors.

For each pair of corresponding elements, it subtracts the element in the second tensor from the element in the first tensor and places the result in the output tensor.

Example:

// Subtraction operation
%result = ttir.subtract(%lhs, %rhs, %output) : tensor<3xi32>, tensor<3xi32>, tensor<3xi32> -> tensor<3xi32>
// Input tensors:
// %lhs: [10, 20, 30]
// %rhs: [1, 2, 3]
// Output tensor:
// [9, 18, 27]

// Example with floating point values
%result = ttir.subtract(%float_lhs, %float_rhs, %float_output) : tensor<3xf32>, tensor<3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensors:
// %float_lhs: [3.5, 0.0, -1.2]
// %float_rhs: [1.5, 2.0, -3.2]
// Output tensor:
// [2.0, -2.0, 2.0]

Note: The data type of the output tensor matches the data type of the input tensors.

Mathematical definition: subtract(x, y) = x - y

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, ThreeOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseBinary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.sum` (tt::ttir::SumOp)

Sum reduction operation.

The sum operation computes the sum of elements along specified dimensions of the input tensor.

This operation reduces the input tensor by computing the sum of all elements along the dimensions specified in dim_arg. If dim_arg is not provided, the sum is computed over all dimensions, resulting in a scalar value. If keep_dim is set to true, the reduced dimensions are retained with a size of 1.

Example:

// Sum along dimension 1
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<2xf32>
%result = ttir.sum(%input, %output) {keep_dim = false, dim_arg = [1: i32]} : tensor<2x3xf32>, tensor<2xf32> -> tensor<2xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [6.0, 15.0]  // Sum of each row

// Sum along dimension 0
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<3xf32>
%result = ttir.sum(%input, %output) {keep_dim = false, dim_arg = [0: i32]} : tensor<2x3xf32>, tensor<3xf32> -> tensor<3xf32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// [5.0, 7.0, 9.0]  // Sum of each column

// Sum over all dimensions
%input = ... : tensor<2x3xf32>
%output = ttir.empty() : tensor<f32>
%result = ttir.sum(%input, %output) {keep_dim = false} : tensor<2x3xf32>, tensor<f32> -> tensor<f32>
// Input tensor:
// [[1.0, 2.0, 3.0],
//  [4.0, 5.0, 6.0]]
// Output tensor:
// 21.0  // Sum of all elements

Mathematical definition: sum(x, dim) = ∑ x[i] for all i in dimension dim

Inputs:

input (Tensor): The input tensor.

Attributes:

keep_dim (Bool): Whether to keep the reduced dimensions or not.
dim_arg (Array of Int32): Dimensions to reduce along.

Outputs:

output (Tensor): The result tensor after applying the reduction.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.tan` (tt::ttir::TanOp)

Elementwise tan operation.

The tan operation computes the tangent of each element in the input tensor.

For each element, it returns the tangent of the angle in radians.

Example:

// Compute tangent of all elements in %input
%result = ttir.tan(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.9601, 0.5403, -0.3, 4.5], ... ]

Mathematical definition: tan(x) = sin(x) / cos(x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.tanh` (tt::ttir::TanhOp)

Elementwise hyperbolic tangent operation.

The tanh operation computes the hyperbolic tangent of each element in the input tensor.

For each element, it returns the hyperbolic tangent of the value.

Example:

// Compute hyperbolic tangent of all elements in %input
%result = ttir.tanh(%input, %output) : tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1.7, 2.0, -0.3, 4.5], ... ]
// Output tensor:
// [[0.9601, 0.5403, -0.3, 4.5], ... ]

Mathematical definition: tanh(x) = (e^x - e^-x) / (e^x + e^-x)

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.to_layout` (tt::ttir::ToLayoutOp)

Layout op.

Syntax:

operation ::= `ttir.to_layout` $input `,` $output `:` type($input) `into` type($output) (`hostInfo` `=` $layout^)? attr-dict (`->` type($results)^)?

ToLayout operation, transition tensors from one layout to another. Some examples include:

Transitioning between different memory spaces, e.g. DRAM to L1.
Transitioning between different data types, e.g. f32 to f16.
Transitioning between different tile sizes, e.g. 1x16 to 32x32
Transitioning between different tensor sharding
Some combination of the above

#layout = #ttcore.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #system>>
#layout1 = #ttcore.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #l1_>>
%1 = "ttir.to_layout"(%arg0, %0) : (tensor<64x128xf32, #layout>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>

Interfaces: BufferizableOpInterface, DestinationStyleOpInterface, MemoryEffectOpInterface, TTIROpInterface

Attributes:

Attribute MLIR Type Description

layout

::mlir::tt::ttcore::MetalLayoutAttr

Tensor layout attribute with explicit physical shape

{{% markdown %}} The tensor layout attribute captures how tensor data is sharded across a grid of devices/cores and is laid out in memory. Note that the presence of this attribute implies that the tensor shape includes sharding (i.e. the first half of the tensor shape represents the grid shape).

Some high level goals:
  - **Logical shapes**: Store the original tensor shape and rank intact and agnostic
    to underlying storage layout.
    Keeping the logical shapes not only makes some graph transformations vastly
    simpler, in particular convs, but it makes the lowered IR much easier to read
    and reason about.  The original tensor shapes leave breadcrumbs that make it
    much easier to map back to the input representation.
  - **Collapsed dims**: We may collapse dimensions during transformation, but it
    is important we capture this information such that it is not lost during tensor
    transformation.  The collapsed_intervals field stores the collapses performed
    during conversion from logical_shape to physical tensor shape.
  - **Padding**: store the desired alignments s.t. padding can be simply encoded;
  dim_alignments field represents alignment along each logical dim during collapse.
  - **Memref translation**: ensure we have all necessary info s.t. we can trivally
    lower a tensor into a memref without any intermediate passes.

For a logical tensor of shape [H, W] distributed across a grid [GY, GX], the tensor shape would be:
- Without tiling: [GY, GX, H/GY, W/GX]
- With tiling: [GY, GX, H/GY/TH, W/GX/TW, TH, TW] where TH,TW are tile dimensions

This makes the representation 1:1 with memrefs and eliminates the need for shape conversion passes.

Examples:
```mlir
// Logical 8x300 tensor distributed across 1x2 grid:
// tensor<1x2x8x150xf32, #tt.metal_layout<logical_shape=8x300, ...>>

// Logical 1024x1024 tensor distributed across 2x2 grid with 32x32 tiles:
// tensor<2x2x16x16x!ttcore.tile<32x32xf32>, #tt.metal_layout<logical_shape=1024x1024, ...>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values or non-0-ranked.memref of any type values
`output`	ranked tensor of any type values or non-0-ranked.memref of any type values

Results:

Result	Description
`results`	variadic of ranked tensor of any type values

`ttir.transpose` (tt::ttir::TransposeOp)

Tensor transpose operation.

The transpose operation swaps two dimensions of a tensor.

This operation exchanges the positions of two specified dimensions in the input tensor, effectively transposing those dimensions. The shape of the output tensor is the same as the input tensor, except that the dimensions specified by dim0 and dim1 are swapped.

Example:

// Transpose dimensions 0 and 1
%input = ... : tensor<2x3x4xf32>
%output = ttir.empty() : tensor<3x2x4xf32>
%result = ttir.transpose(%input, %output) {dim0 = 0 : i32, dim1 = 1 : i32} :
    tensor<2x3x4xf32>, tensor<3x2x4xf32> -> tensor<3x2x4xf32>
// Input tensor shape: [2, 3, 4]
// Output tensor shape: [3, 2, 4]

// Transpose dimensions 1 and 2
%input = ... : tensor<2x3x4xf32>
%output = ttir.empty() : tensor<2x4x3xf32>
%result = ttir.transpose(%input, %output) {dim0 = 1 : i32, dim1 = 2 : i32} :
    tensor<2x3x4xf32>, tensor<2x4x3xf32> -> tensor<2x4x3xf32>
// Input tensor shape: [2, 3, 4]
// Output tensor shape: [2, 4, 3]

Inputs:

input (Tensor): The input tensor.

Attributes:

dim0 (Integer): The first dimension to swap.
dim1 (Integer): The second dimension to swap.

Outputs:

result (Tensor): The transposed tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_TensorManipulation

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dim0`	::mlir::IntegerAttr	32-bit signed integer attribute
`dim1`	::mlir::IntegerAttr	32-bit signed integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.typecast` (tt::ttir::TypecastOp)

Elementwise type casting operation.

The typecast operation converts each element in the input tensor to a different data type.

This operation performs element-wise type conversion, such as converting from integers to floating-point values or between different floating-point precisions. The conversion follows the standard type conversion rules for the target platform.

Example:

// Cast from int32 to float32
%result = ttir.typecast(%input, %output) : tensor<4x4xi32>, tensor<4x4xf32> -> tensor<4x4xf32>
// Input tensor:
// [[1, 2, -3, 4], ... ]
// Output tensor:
// [[1.0, 2.0, -3.0, 4.0], ... ]

// Cast from float32 to int32
%result = ttir.typecast(%float_input, %int_output) : tensor<3xf32>, tensor<3xi32> -> tensor<3xi32>
// Input tensor:
// [1.7, -2.3, 3.0]
// Output tensor:
// [1, -2, 3]  // Note: truncation, not rounding

// Cast from float32 to float64 (higher precision)
%result = ttir.typecast(%f32_input, %f64_output) : tensor<2xf32>, tensor<2xf64> -> tensor<2xf64>
// Input tensor:
// [3.14159, 2.71828]
// Output tensor:
// [3.14159, 2.71828]  // Same values but with higher precision

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable, TwoOperands

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseUnary, TTIR_QuantizableOpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`conservative_folding`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.unsqueeze` (tt::ttir::UnsqueezeOp)

Tensor dimension insertion operation.

The unsqueeze operation inserts a dimension of size 1 into the shape of a tensor.

This operation is the inverse of the squeeze operation and is commonly used to add a singleton dimension to a tensor's shape. It specifies which position to insert the new dimension using the dim attribute.

Example:

// Insert a dimension at position 0 of a tensor with shape [3, 4]
%input = ... : tensor<3x4xf32>  // Input tensor with shape [3, 4]
%output = ttir.empty() : tensor<1x3x4xf32>  // Output tensor shape
%result = ttir.unsqueeze(%input, %output) {
    dim = 0 : i32  // Position to insert the new dimension
} : tensor<3x4xf32>, tensor<1x3x4xf32> -> tensor<1x3x4xf32>
// Result: tensor with shape [1, 3, 4]

// Insert a dimension at position 1 of a tensor with shape [2, 3]
%input = ... : tensor<2x3xf32>  // Input tensor with shape [2, 3]
%output = ttir.empty() : tensor<2x1x3xf32>  // Output tensor shape
%result = ttir.unsqueeze(%input, %output) {
    dim = 1 : i32  // Position to insert the new dimension
} : tensor<2x3xf32>, tensor<2x1x3xf32> -> tensor<2x1x3xf32>
// Result: tensor with shape [2, 1, 3]

Inputs:

input (Tensor): The input tensor to unsqueeze.

Attributes:

dim (Integer): The position to insert the new dimension.

Outputs:

result (Tensor): The unsqueezed tensor.

Note: The shape of the output tensor is the same as the input tensor with a new dimension of size 1 inserted at the specified position. For example, unsqueezing at position 1 of a tensor with shape [2, 3] results in a tensor with shape [2, 1, 3].

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dim`	::mlir::IntegerAttr	32-bit signed integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.update_cache` (tt::ttir::UpdateCacheOp)

Cache update operation.

The update_cache operation updates a cache tensor with values from an input tensor at specific indices.

This operation is commonly used in sequence models like transformers to update a key-value cache with new token information. It takes a cache tensor, an input tensor, and update indices, and updates the cache at the specified positions.

Example:

// Update cache at specific indices
%cache = ... : tensor<2x16x64xf32>  // Batch size 2, sequence length 16, hidden dim 64
%input = ... : tensor<2x1x64xf32>   // New token embeddings
%update_index = ... : tensor<1xi32> // Update at position [15]
%result = ttir.update_cache(%cache, %input, %update_index) {batch_offset = 0 : i32} :
    tensor<2x16x64xf32>, tensor<2x1x64xf32>, tensor<1xi32> -> tensor<2x16x64xf32>
// The cache tensor is updated at position 15 for both batches with the values from input

Inputs:

cache (Tensor): The cache tensor to be updated.
input (Tensor): The input tensor containing new values.
update_index (Tensor): Indices specifying where to update the cache.

Attributes:

batch_offset (Integer): Offset in the batch dimension.

Outputs:

result (Tensor): The updated cache tensor.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`batch_offset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:

Operand	Description
`cache`	ranked tensor of any type values
`input`	ranked tensor of any type values
`update_index`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.upsample2d` (tt::ttir::Upsample2dOp)

Upsample 2D operation.

The upsample2d operation increases the spatial dimensions (height and width) of an input tensor.

This operation is commonly used in neural networks to increase the spatial resolution of feature maps. It supports different upsampling algorithms such as "nearest" and "bilinear" interpolation. The input tensor is assumed to be in NHWC format (batch, height, width, channels).

Example:

// Upsample a tensor with different scale factors for height and width
%input = ... : tensor<10x64x32x3xbf16>  // Input tensor: [batch=10, height=64, width=32, channels=3]
%output = ttir.empty() : tensor<10x128x128x3xbf16>  // Output tensor shape
%result = ttir.upsample2d(%input, %output) {
    scale_factor = [2, 4],  // Scale height by 2, width by 4
    mode = "bilinear"       // Use bilinear interpolation
} : tensor<10x64x32x3xbf16>, tensor<10x128x128x3xbf16> -> tensor<10x128x128x3xbf16>
// Result: tensor with shape [10,128,128,3]

// Upsample with the same scale factor for both dimensions
%input = ... : tensor<1x32x32x16xf32>  // Input tensor
%output = ttir.empty() : tensor<1x64x64x16xf32>  // Output tensor shape
%result = ttir.upsample2d(%input, %output) {
    scale_factor = 2,     // Scale both height and width by 2
    mode = "nearest"      // Use nearest neighbor interpolation
} : tensor<1x32x32x16xf32>, tensor<1x64x64x16xf32> -> tensor<1x64x64x16xf32>
// Result: tensor with shape [1,64,64,16]

Inputs:

input (Tensor): The input tensor to upsample, in NHWC format.

Attributes:

scale_factor (Integer or Array of Integer): The scale factor for upsampling in height and width dimensions. If a single integer is provided, it's used for both dimensions. If an array is provided, the first value is used for height and the second for width.
mode (String, default="nearest"): The upsampling algorithm to use. Currently supported values are "nearest" for nearest neighbor interpolation and "bilinear" for bilinear interpolation.

Outputs:

result (Tensor): The upsampled tensor.

Note: The output height is calculated as input_height * scale_factor[0] and the output width as input_width * scale_factor[1]. The batch and channel dimensions remain unchanged.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`scale_factor`	::mlir::Attribute	32-bit signed integer attribute or i32 dense array attribute
`mode`	::mlir::StringAttr	string attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.view_layout` (tt::ttir::ViewLayoutOp)

View Layout op.

Syntax:

operation ::= `ttir.view_layout` $input attr-dict `:` type($input) `->` type($result)

ViewLayout operation, used to take a view of one layout into another. Note that this op is purely representational and doesn't have any side-effects. Its primary usecase is to allow reinterpreting the layout of a tensor without actually moving the data. Consumers of this op are expected to compose the layout with the underlying backing layout.

Additional notes/constraints:

It is not capable of changing the data type nor the memory space of the tensor.
If reinterpretLayout is true, the layout view change can include a data type cast, but note this does not actually change the format of the data in memory.
All ViewLayout ops can trivially be converted to ToLayout ops.

#layout = #ttcore.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #system>>
#layout1 = #ttcore.metal_layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #l1_>>
%1 = "ttir.view_layout"(%arg0, %0) : (tensor<64x128xf32, #layout>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>

Traits: AlwaysSpeculatableImplTrait

Interfaces: BufferizableOpInterface, ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpAsmOpInterface, TTIROpInterface, TTIR_ViewOpInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`reinterpretLayout`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values or non-0-ranked.memref of any type values

Results:

Result	Description
`result`	ranked tensor of any type values or non-0-ranked.memref of any type values

`ttir.where` (tt::ttir::WhereOp)

Elementwise conditional selection operation based on a predicate.

The where operation performs element-wise conditional selection based on a predicate.

For each element position, it selects between two values based on a boolean condition in first tensor:

If the condition is true (non-zero), it selects the corresponding element from the second tensor
If the condition is false (zero), it selects the corresponding element from the third tensor

This operation supports broadcasting, allowing inputs of different shapes to be combined according to standard broadcasting rules.

Example:

// Select elements from %true_values where %condition is true,
// otherwise select from %false_values
%result = ttir.where(%condition, %true_values, %false_values, %output) : tensor<4x4xi1>, tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>

// With broadcasting (condition is a scalar)
%result = ttir.where(%scalar_condition, %true_values, %false_values, %output) : tensor<1xi1>, tensor<4x4xf32>, tensor<4x4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>

This operation is equivalent to the ternary conditional operator (condition ? true_value : false_value) in many programming languages, applied elementwise across tensors.

Traits: AlwaysSpeculatableImplTrait, TTIR_Broadcastable

Interfaces: ConditionallySpeculatable, DestinationStyleOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTIROpInterface, TTIR_ElementwiseTernary

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`first`	ranked tensor of any type values
`second`	ranked tensor of any type values
`third`	ranked tensor of any type values
`output`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttir.zeros` (tt::ttir::ZerosOp)

Creates a tensor filled with zeros.

The zeros operation creates a tensor filled with zeros of the specified shape.

This operation is commonly used to initialize tensors with zero values. It takes a shape attribute and produces a tensor of that shape with all elements set to zero.

Example:

// Create a 3D tensor of zeros with shape [64, 28, 28]
%result = ttir.zeros() {
    shape = [64, 28, 28]
} : () -> tensor<64x28x28xbf16>
// Result: A tensor of shape [64, 28, 28] filled with zeros

// Create a 2D tensor of zeros with shape [3, 4]
%result = ttir.zeros() {
    shape = [3, 4]
} : () -> tensor<3x4xf32>
// Result: [[0.0, 0.0, 0.0, 0.0],
//          [0.0, 0.0, 0.0, 0.0],
//          [0.0, 0.0, 0.0, 0.0]]

Attributes:

shape (Array of Integer): The shape of the tensor to create.

Outputs:

result (Tensor): The tensor filled with zeros.

Note: The element type of the result tensor is determined by the return type specified in the operation. This operation is useful for initializing tensors before filling them with computed values or as a starting point for accumulation operations.

Traits: AlwaysSpeculatableImplTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`shape`	::mlir::DenseI32ArrayAttr	i32 dense array attribute

Results:

Result	Description
`result`	ranked tensor of any type values

'ttkernel' Dialect

A TTKernel out-of-tree MLIR dialect.

This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

ArgAttr

Kernel argument.

Syntax:

#ttkernel.arg<
  ArgType,   # arg_type
  size_t,   # operand_index
  bool   # is_uniform
>

Parameters:

Parameter	C++ type	Description
arg_type	`ArgType`
operand_index	`size_t`
is_uniform	`bool`

ArgSpecAttr

Kernel argument specification.

Syntax:

#ttkernel.arg_spec<
  ::llvm::ArrayRef<ArgAttr>,   # rt_args
  ::llvm::ArrayRef<ArgAttr>   # ct_args
>

A list of argument attibutes, of which form the argument specification for this kernel.

Parameters:

Parameter	C++ type	Description
rt_args	`::llvm::ArrayRef<ArgAttr>`
ct_args	`::llvm::ArrayRef<ArgAttr>`

ReduceDimAttr

TTKernel Reduce Dimensions

Syntax:

#ttkernel.reduce_dim<
  ::mlir::tt::ttkernel::ReduceDim   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttkernel::ReduceDim`	an enum of type ReduceDim

ReduceTypeAttr

TTKernel Reduce Types

Syntax:

#ttkernel.reduce_type<
  ::mlir::tt::ttkernel::ReduceType   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttkernel::ReduceType`	an enum of type ReduceType

ThreadTypeAttr

TTKernel ThreadTypes

Syntax:

#ttkernel.thread<
  ::mlir::tt::ttkernel::ThreadType   # value
>

Parameters:

Parameter	C++ type	Description
value	`::mlir::tt::ttkernel::ThreadType`	an enum of type ThreadType

`ttkernel.abs_tile_int32` (tt::ttkernel::AbsTileI32Op)

Absolute value i32 tile in the DST at specified index.

Syntax:

operation ::= `ttkernel.abs_tile_int32` `(` $dst0_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of absolute value operation DST[dst0_index] <- abs(DST[dst0_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer

`ttkernel.abs_tile_init` (tt::ttkernel::AbsTileInitOp)

Init function for abs_tile operation. Refer to documentation for any init function.

Syntax:

operation ::= `ttkernel.abs_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before abs_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.abs_tile` (tt::ttkernel::AbsTileOp)

Absolute value tile in the DST at specified index.

Syntax:

operation ::= `ttkernel.abs_tile` `(` $dst0_index `)` attr-dict `:` functional-type(operands, results)

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer

`ttkernel.add_binary_tile_init` (tt::ttkernel::AddBinaryTilesInitOp)

Short init function

Syntax:

operation ::= `ttkernel.add_binary_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before add_binary_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.add_binary_tile` (tt::ttkernel::AddBinaryTilesOp)

Addition operation between two tiles

Syntax:

operation ::= `ttkernel.add_binary_tile` `(` $dst0_index `,` $dst1_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of addition operation DST[dst0_index] <- DST[dst0_index] + DST[dst1_index] on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_BinaryOpTrait, TTKernel_SFPUOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer
`dst1_index`	index or 32-bit signless integer

`ttkernel.add_tiles_init` (tt::ttkernel::AddTilesInitOp)

Short init function

Syntax:

operation ::= `ttkernel.add_tiles_init` `(` $in0_cb `,` $in1_cb `)` attr-dict `:` functional-type(operands, results)

Must be run before add_tiles.

Traits: TTKernel_InitOpTrait

Operands:

Operand	Description
`in0_cb`	TTKernel cb
`in1_cb`	TTKernel cb

`ttkernel.add_tiles` (tt::ttkernel::AddTilesOp)

Add operation

Syntax:

operation ::= `ttkernel.add_tiles` `(` $in0_cb `,` $in1_cb `,` $in0_tile_index `,` $in1_tile_index `,` $dst_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise addition C=A+B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_BinaryOpTrait, TTKernel_FPUOpTrait

Operands:

Operand	Description
`in0_cb`	TTKernel cb
`in1_cb`	TTKernel cb
`in0_tile_index`	index or 32-bit signless integer
`in1_tile_index`	index or 32-bit signless integer
`dst_index`	index or 32-bit signless integer

`ttkernel.binary_op_init_common` (tt::ttkernel::BinaryOpInitCommonOp)

Init function for all binary ops

Syntax:

operation ::= `ttkernel.binary_op_init_common` `(` $in0_cb `,` $in1_cb `,` $out_cb `)` attr-dict `:` functional-type(operands, results)

Followed by the specific init required with an opcode (binrary_op_specific_init).

Traits: TTKernel_InitOpTrait

Operands:

Operand	Description
`in0_cb`	TTKernel cb
`in1_cb`	TTKernel cb
`out_cb`	TTKernel cb

`ttkernel.cb_pop_front` (tt::ttkernel::CBPopFrontOp)

CBPopFront call.

Syntax:

operation ::= `ttkernel.cb_pop_front` `(` $cb `,` $numPages `)` attr-dict `:` functional-type(operands, results)

CBPopFront operation

Operands:

Operand	Description
`cb`	TTKernel cb
`numPages`	32-bit signless integer

`ttkernel.cb_push_back` (tt::ttkernel::CBPushBackOp)

CBPushBack call.

Syntax:

operation ::= `ttkernel.cb_push_back` `(` $cb `,` $numPages `)` attr-dict `:` functional-type(operands, results)

CBPushBack operation

Operands:

Operand	Description
`cb`	TTKernel cb
`numPages`	32-bit signless integer

`ttkernel.cb_reinterpret_shape` (tt::ttkernel::CBReinterpretShapeOp)

Get the data format of a given CB

Syntax:

operation ::= `ttkernel.cb_reinterpret_shape` `(` $input `)` attr-dict `:` functional-type(operands, results)

get_dataformat operation

Operands:

Operand	Description
`input`	TTKernel cb

Results:

Result	Description
`output`	TTKernel cb

`ttkernel.cb_reserve_back` (tt::ttkernel::CBReserveBackOp)

CBReserveBack call.

Syntax:

operation ::= `ttkernel.cb_reserve_back` `(` $cb `,` $numPages `)` attr-dict `:` functional-type(operands, results)

CBReserveBack operation

Operands:

Operand	Description
`cb`	TTKernel cb
`numPages`	32-bit signless integer

`ttkernel.cb_wait_front` (tt::ttkernel::CBWaitFrontOp)

CBWaitFront call.

Syntax:

operation ::= `ttkernel.cb_wait_front` `(` $cb `,` $numPages `)` attr-dict `:` functional-type(operands, results)

CBWaitFront operation

Operands:

Operand	Description
`cb`	TTKernel cb
`numPages`	32-bit signless integer

`ttkernel.reinterpret_cast<volatile tt_l1_ptr uint32_t*>` (tt::ttkernel::CastToL1PtrOp)

CastToL1Ptr

Syntax:

operation ::= `ttkernel.reinterpret_cast<volatile tt_l1_ptr uint32_t*>` `(` $addr `)` attr-dict `:` functional-type(operands, results)

Cast specified addr to L1 pointer.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`addr`	32-bit signless integer or TTKernel l1 address or TTKernel semaphore

Results:

Result	Description
`l1_ptr`	TTKernel l1 address pointer

`ttkernel.ceil_tile_float32` (tt::ttkernel::CeilTileF32Op)

Ceil f32 tile in the DST at specified index.

Syntax:

operation ::= `ttkernel.ceil_tile_float32` `(` $dst0_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of ceil operation DST[dst0_index] <- ceil(DST[dst0_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer

`ttkernel.ceil_tile` (tt::ttkernel::CeilTileOp)

Ceil tile in the DST at specified index.

Syntax:

operation ::= `ttkernel.ceil_tile` `(` $dst0_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of ceil operation DST[dst0_index] <- ceil(DST[dst0_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer

`ttkernel.compute_kernel_hw_startup` (tt::ttkernel::ComputeKernelHWStartupOp)

Compute_kernel_hw_startup

Syntax:

operation ::= `ttkernel.compute_kernel_hw_startup` `(` $icb0 (`,` $icb1^)? `,` $ocb `)` attr-dict `:` functional-type(operands, results)

Must be run at the start of compute kernel.

Traits: TTKernel_InitOpTrait

Operands:

Operand	Description
`icb0`	TTKernel cb
`icb1`	TTKernel cb
`ocb`	TTKernel cb

`ttkernel.copy_dest_values_init` (tt::ttkernel::CopyDestValuesInitOp)

Short init function

Syntax:

operation ::= `ttkernel.copy_dest_values_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before copy_dest_values.

Traits: TTKernel_InitOpTrait

`ttkernel.copy_dest_values` (tt::ttkernel::CopyDestValuesOp)

Copies all values from the tile in idst1 to the tile in idst0 in the DST register buffer.

Syntax:

operation ::= `ttkernel.copy_dest_values` `(` $dst0_index `,` $dst1_index `)` attr-dict `:` functional-type(operands, results)

Copies all values from the tile in idst1 to the tile in idst0 in the DST register buffer. Performs element-wise computation of copy operation DST[dst0_index] <- DST[dst1_index] The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_BinaryOpTrait, TTKernel_SFPUOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer
`dst1_index`	index or 32-bit signless integer

`ttkernel.copy_tile_init` (tt::ttkernel::CopyTileInitOp)

Perform the init for copy tile. This does not reconfigure the unpacker data types.

Syntax:

operation ::= `ttkernel.copy_tile_init` `(` $cb0 `)` attr-dict `:` functional-type(operands, results)

Must be called before copy_tile.

Traits: TTKernel_InitOpTrait

Operands:

Operand	Description
`cb0`	TTKernel cb

`ttkernel.copy_tile` (tt::ttkernel::CopyTileOp)

Copy tile from specified CB to DST.

Syntax:

operation ::= `ttkernel.copy_tile` `(` $cb0 `,` $tile_index_cb `,` $tile_index_dst `)` attr-dict `:` functional-type(operands, results)

Copies a single tile from the specified input CB and writes the result to DST at a specified index. The function will employ unpacker to first unpack into SRC registers and then perform move into DST registers, at a specified index. For the in_tile_index to be valid for this call, cb_wait_front(n) had to be previously called to ensure that at least some number n>0 of tiles are available in the input CB. The CB index 0 then references the first tile in the received section of the CB, up to index n-1 (in a FIFO order). The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Operands:

Operand	Description
`cb0`	TTKernel cb
`tile_index_cb`	index or 32-bit signless integer
`tile_index_dst`	index or 32-bit signless integer

`ttkernel.cos_tile_init` (tt::ttkernel::CosTileInitOp)

Short init function which configures compute unit for execution of cos_tile.

Syntax:

operation ::= `ttkernel.cos_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before cos_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.cos_tile` (tt::ttkernel::CosTileOp)

Cos operation

Syntax:

operation ::= `ttkernel.cos_tile` `(` $tile_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of the trigonometric cosine operation on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`tile_index`	index or 32-bit signless integer

`ttkernel.dprint` (tt::ttkernel::DPrintOp)

Print to output stream from kernel.

Syntax:

operation ::= `ttkernel.dprint` `(` $fmt (`,` $argv^)? `)` attr-dict `:` functional-type($argv, results)

std::format style format string:

rewriter.create<ttkernel::DPrintOp>(loc, "nocY={} nocX={} addr={}\\n",
                                  nocY, nocX, addr);

ttkernel.dprint("virtY {} virtX {} addr {}\\n", %14, %15, %13) : (index, index, i32)

Notes:

Only trivial format specifier currently supported, i.e. {}.
Must double escape newline character or other special characters.

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Attributes:

Attribute	MLIR Type	Description
`fmt`	::mlir::StringAttr	string attribute

Operands:

Operand	Description
`argv`	variadic of any type

`ttkernel.div_binary_tile_init` (tt::ttkernel::DivBinaryTilesInitOp)

Short init function

Syntax:

operation ::= `ttkernel.div_binary_tile_init` `(` `)` attr-dict `:`functional-type(operands, results)

Must be run before div_binary_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.div_binary_tile` (tt::ttkernel::DivBinaryTilesOp)

Divide operation between two tiles

Syntax:

operation ::= `ttkernel.div_binary_tile` `(` $dst0_index `,` $dst1_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of division operation DST[dst0_index] <- DST[dst0_index] / DST[dst1_index] on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_BinaryOpTrait, TTKernel_SFPUOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer
`dst1_index`	index or 32-bit signless integer

`ttkernel.exp_tile_init` (tt::ttkernel::ExpTileInitOp)

Short init function which configures compute unit for execution of exp_tile.

Syntax:

operation ::= `ttkernel.exp_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before exp_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.exp_tile` (tt::ttkernel::ExpTileOp)

Exp operation

Syntax:

operation ::= `ttkernel.exp_tile` `(` $tile_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of exponential on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`tile_index`	index or 32-bit signless integer

`ttkernel.experimental::get_noc_multicast_addr` (tt::ttkernel::ExperimentalGetNocMulticastAddrOp)

Experimental GetNocMulticastAddr

Syntax:

operation ::= `ttkernel.experimental::get_noc_multicast_addr` `(` $noc_x_start `,` $noc_y_start `,` $noc_x_end `,` $noc_y_end `,` $addr (`,` $noc^)? `)` attr-dict `:` functional-type(operands, results)

Default tt-metal get_noc_multicast_addr, but flips mcast start and end coordinates on NOC1.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`noc_x_start`	index or 32-bit signless integer
`noc_y_start`	index or 32-bit signless integer
`noc_x_end`	index or 32-bit signless integer
`noc_y_end`	index or 32-bit signless integer
`addr`	32-bit signless integer or TTKernel l1 address or TTKernel semaphore
`noc`	8-bit signless integer

Results:

Result	Description
`mcastNocAddr`	TTKernel noc address

`ttkernel.experimental::tilize_block` (tt::ttkernel::ExperimentalTilizeBlockOp)

Experimental TilizeBlockOp call.

Syntax:

operation ::= `ttkernel.experimental::tilize_block` `(` $cbIn `,` $cbOut `,` $blockR `,` $blockC `)` attr-dict `:` functional-type(operands, results)

This is a custom tilize block LLK that takes the dimensions of the block, and properly tilizes each row.

Operands:

Operand	Description
`cbIn`	TTKernel cb
`cbOut`	TTKernel cb
`blockR`	32-bit signless integer
`blockC`	32-bit signless integer

`ttkernel.experimental::untilize_block` (tt::ttkernel::ExperimentalUntilizeBlockOp)

Experimental UntilizeBlockOp call.

Syntax:

operation ::= `ttkernel.experimental::untilize_block` `(` $cbIn `,` $cbOut `,` $blockR `,` $blockC `)` attr-dict `:` functional-type(operands, results)

This is a custom untilize block LLK that takes the dimensions of the block.

Operands:

Operand	Description
`cbIn`	TTKernel cb
`cbOut`	TTKernel cb
`blockR`	32-bit signless integer
`blockC`	32-bit signless integer

`ttkernel.fill_tile_init` (tt::ttkernel::FillTileInitOp)

Init function for fill_tile operation. Refer to documentation for any init function.

Syntax:

operation ::= `ttkernel.fill_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before fill_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.fill_tile` (tt::ttkernel::FillTileOp)

Fill tile with specified value.

Syntax:

operation ::= `ttkernel.fill_tile` `(` $dst_index `,` $value `)` attr-dict `:` functional-type(operands, results)

Fills supplied DST register tile with a supplied f32 value. The DST register must be in acquired state via tile_regs_acquire call.

Example:

ttkernel.fill_tile(%dst_index, %value);

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`dst_index`	index or 32-bit signless integer
`value`	32-bit float

`ttkernel.floor_tile_float32` (tt::ttkernel::FloorTileF32Op)

Floor f32 tile in the DST at specified index.

Syntax:

operation ::= `ttkernel.floor_tile_float32` `(` $dst0_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of floor operation DST[dst0_index] <- floor(DST[dst0_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer

`ttkernel.floor_tile` (tt::ttkernel::FloorTileOp)

Floor tile in the DST at specified index.

Syntax:

operation ::= `ttkernel.floor_tile` `(` $dst0_index `)` attr-dict `:` functional-type(operands, results)

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer

`ttkernel.get_arg_val` (tt::ttkernel::GetArgValOp)

Get runtime arg value.

Syntax:

operation ::= `ttkernel.get_arg_val` `(` $arg_index `)` attr-dict `:` functional-type(operands, results)

Get runtime argument value at specified index.

Operands:

Operand	Description
`arg_index`	index or 32-bit signless integer

Results:

Result	Description
`arg_val`	32-bit signless integer or TTKernel cb or TTKernel l1 address

`ttkernel.get_common_arg_val` (tt::ttkernel::GetCommonArgValOp)

Get common runtime arg value.

Syntax:

operation ::= `ttkernel.get_common_arg_val` `(` $arg_index `)` attr-dict `:` functional-type(operands, results)

Get runtime argument value at specified index. (Indexes from different location compared to get_arg_val)

Operands:

Operand	Description
`arg_index`	index or 32-bit signless integer

Results:

Result	Description
`arg_val`	32-bit signless integer or TTKernel cb or TTKernel l1 address

`ttkernel.get_compile_time_arg_val` (tt::ttkernel::GetCompileArgValOp)

Get compile-time arg value.

Syntax:

operation ::= `ttkernel.get_compile_time_arg_val` `(` $arg_index `)` attr-dict `:` functional-type(operands, results)

Get compile-time argument value at specified index.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`arg_index`	::mlir::IntegerAttr	32-bit signless integer attribute

Results:

Result	Description
`arg_val`	32-bit signless integer or TTKernel cb or TTKernel l1 address

`ttkernel.get_dataformat` (tt::ttkernel::GetDataFormatOp)

Get the data format of a given CB

Syntax:

operation ::= `ttkernel.get_dataformat` `(` $cb `)` attr-dict `:` functional-type(operands, results)

get_dataformat operation

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`cb`	TTKernel cb

Results:

Result	Description
`dataFormat`	TTKernel compute data format type

`ttkernel.get_interleaved_addr_gen_fast` (tt::ttkernel::GetInterleavedAddrGenFastOp)

GetInterleavedAddrGenFastOp

Syntax:

operation ::= `ttkernel.get_interleaved_addr_gen_fast` `(` $DRAM `,` $bank_base_address `,` $page_size `,` $data_format `)` attr-dict `:` functional-type(operands, results)

Returns an InterleavedAddrGenFast type.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`DRAM`	1-bit signless integer
`bank_base_address`	32-bit signless integer
`page_size`	32-bit signless integer
`data_format`	TTKernel compute data format type

Results:

Result	Description
`result`	TTKernel InterleavedAddrGenFast type

`ttkernel.get_noc_addr_from_bank_id` (tt::ttkernel::GetNocAddrFromBankIDOp)

GetNocAddrFromBankID

Syntax:

operation ::= `ttkernel.get_noc_addr_from_bank_id` `(` $bank_id `,` $bankAddressOffset `)` attr-dict `:` functional-type(operands, results)

GetNocAddrFromBankID api

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`bank_id`	32-bit signless integer
`bankAddressOffset`	32-bit signless integer

Results:

Result	Description
`nocAddr`	TTKernel noc address

`ttkernel.get_noc_addr` (tt::ttkernel::GetNocAddrOp)

GetNocAddr

Syntax:

operation ::= `ttkernel.get_noc_addr` `(` $x `,` $y `,` $l1Address `)` attr-dict `:` functional-type(operands, results)

GetNocAddr api including core coordinates

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`x`	index or 32-bit signless integer
`y`	index or 32-bit signless integer
`l1Address`	32-bit signless integer or TTKernel l1 address or TTKernel semaphore

Results:

Result	Description
`nocAddr`	TTKernel noc address

`ttkernel.get_noc_multicast_addr` (tt::ttkernel::GetNocMulticastAddrOp)

GetNocMulticastAddr

Syntax:

operation ::= `ttkernel.get_noc_multicast_addr` `(` $noc_x_start `,` $noc_y_start `,` $noc_x_end `,` $noc_y_end `,` $addr (`,` $noc^)? `)` attr-dict `:` functional-type(operands, results)

Default tt-metal get_noc_multicast_addr

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`noc_x_start`	index or 32-bit signless integer
`noc_y_start`	index or 32-bit signless integer
`noc_x_end`	index or 32-bit signless integer
`noc_y_end`	index or 32-bit signless integer
`addr`	32-bit signless integer or TTKernel l1 address or TTKernel semaphore
`noc`	8-bit signless integer

Results:

Result	Description
`mcastNocAddr`	TTKernel noc address

`ttkernel.get_read_ptr` (tt::ttkernel::GetReadPtrOp)

GetReadPtr

Syntax:

operation ::= `ttkernel.get_read_ptr` `(` $cb `)` attr-dict `:` functional-type(operands, results)

GetReadPtr operation

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`cb`	TTKernel cb

Results:

Result	Description
`readPtr`	32-bit signless integer

`ttkernel.get_semaphore` (tt::ttkernel::GetSemaphoreOp)

GetSemaphoreOp

Syntax:

operation ::= `ttkernel.get_semaphore` `(` $semaphore `)` attr-dict `:` functional-type(operands, results)

Get L1 addr of the semaphore with specified semaphore id

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`semaphore`	index or 32-bit signless integer

Results:

Result	Description
`sem_addr`	TTKernel semaphore

`ttkernel.get_tile_size` (tt::ttkernel::GetTileSizeOp)

Get the tile size in bytes of a given CB

Syntax:

operation ::= `ttkernel.get_tile_size` `(` $cb `)` attr-dict `:` functional-type(operands, results)

get_tile_size operation

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`cb`	TTKernel cb

Results:

Result	Description
`tileSizeBytes`	32-bit signless integer

`ttkernel.get_write_ptr` (tt::ttkernel::GetWritePtrOp)

GetWritePtr

Syntax:

operation ::= `ttkernel.get_write_ptr` `(` $cb `)` attr-dict `:` functional-type(operands, results)

GetWritePtr operation

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`cb`	TTKernel cb

Results:

Result	Description
`writePtr`	32-bit signless integer

`ttkernel.init_sfpu` (tt::ttkernel::InitSFPUOp)

Initialization function for SFPU operations.

Syntax:

operation ::= `ttkernel.init_sfpu` `(` $icb `,` $ocb `)` attr-dict `:` functional-type(operands, results)

This operation initializes all necessary components for SFPU operations, including unpacking, packing, and math configurations.

Traits: TTKernel_InitOpTrait

Operands:

Operand	Description
`icb`	TTKernel cb
`ocb`	TTKernel cb

`ttkernel.invoke_sfpi` (tt::ttkernel::InvokeSFPIOp)

Syntax:

operation ::= `ttkernel.invoke_sfpi` attr-dict-with-keyword $region

Traits: NoTerminator, TTKernel_SFPUOpTrait

`ttkernel.log_tile_init` (tt::ttkernel::LogTileInitOp)

Short init function which configures compute unit for execution of log_tile.

Syntax:

operation ::= `ttkernel.log_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before log_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.log_tile` (tt::ttkernel::LogTileOp)

Log operation

Syntax:

operation ::= `ttkernel.log_tile` `(` $tile_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of log on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`tile_index`	index or 32-bit signless integer

`ttkernel.logical_not_unary_tile_int32` (tt::ttkernel::LogicalNotUnaryTileI32Op)

Logical negation tile in the DST at specified index.

Syntax:

operation ::= `ttkernel.logical_not_unary_tile_int32` `(` $dst0_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of logical negation operation for the int32 data type DST[dst0_index] <- !DST[dst0_index] on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer

`ttkernel.logical_not_unary_tile_init` (tt::ttkernel::LogicalNotUnaryTileInitOp)

_Init function for logical_not_unary_tile(int32) operation. Refer to documentation for any init function.

Syntax:

operation ::= `ttkernel.logical_not_unary_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before logical_not_unary_tile(_int32).

Traits: TTKernel_InitOpTrait

`ttkernel.logical_not_unary_tile` (tt::ttkernel::LogicalNotUnaryTileOp)

Logical negation tile in the DST at specified index.

Syntax:

operation ::= `ttkernel.logical_not_unary_tile` `(` $dst0_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of logical negation operation DST[dst0_index] <- !DST[dst0_index] on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer

`ttkernel.mm_init` (tt::ttkernel::MatmulInitOp)

Matmul init function

Syntax:

operation ::= `ttkernel.mm_init` `(` $in0_cb `,` $in1_cb `,` $out_cb `,` $transpose `)` attr-dict `:` functional-type(operands, results)

Can only be run ONCE per kernel. Should be run before matmul.

Traits: TTKernel_InitOpTrait

Operands:

Operand	Description
`in0_cb`	TTKernel cb
`in1_cb`	TTKernel cb
`out_cb`	TTKernel cb
`transpose`	32-bit signless integer

`ttkernel.mm_init_short` (tt::ttkernel::MatmulInitShortOp)

Matmul short init function

Syntax:

operation ::= `ttkernel.mm_init_short` `(` $in0_cb `,` $in1_cb `,` $transpose `)` attr-dict `:` functional-type(operands, results)

Can be run MULTIPLE times per kernel. Should be run before matmul. Use this if some other init was called between mm_init and matmul_tiles. (i.e. in a loop)

Traits: TTKernel_InitOpTrait

Operands:

Operand	Description
`in0_cb`	TTKernel cb
`in1_cb`	TTKernel cb
`transpose`	32-bit signless integer

`ttkernel.matmul_tiles` (tt::ttkernel::MatmulTilesOp)

Matmul tiles operation

Syntax:

operation ::= `ttkernel.matmul_tiles` `(` $in0_cb_id `,` $in1_cb_id `,` $in0_tile_idx `,` $in1_tile_idx `,` $dst_tile_idx `,` $transpose `)` attr-dict `:` functional-type(operands, results)

Performs tile-sized matrix multiplication C=A*B between the tiles in two specified input CBs and writes the result to DST. The DST register buffer must be in acquired state via ttkernel.tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_FPUOpTrait, TTKernel_TernaryOpTrait

Operands:

Operand	Description
`in0_cb_id`	TTKernel cb
`in1_cb_id`	TTKernel cb
`in0_tile_idx`	index or 32-bit signless integer
`in1_tile_idx`	index or 32-bit signless integer
`dst_tile_idx`	index or 32-bit signless integer
`transpose`	32-bit signless integer

`ttkernel.max_tile_init` (tt::ttkernel::MaxTilesInitOp)

Short init function

Syntax:

operation ::= `ttkernel.max_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before max_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.max_tile` (tt::ttkernel::MaxTilesOp)

Max operation

Syntax:

operation ::= `ttkernel.max_tile` `(` $dst0_index `,` $dst1_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of maximum operation DST[dst0_index] <- max(DST[dst0_index], DST[dst1_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_BinaryOpTrait, TTKernel_SFPUOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer
`dst1_index`	index or 32-bit signless integer

`ttkernel.mem_zeros_base` (tt::ttkernel::MemZerosBaseOp)

Op corresponding to MEM_ZEROS_BASE macro in kernels.

Syntax:

operation ::= `ttkernel.mem_zeros_base` `(` `)` attr-dict `:` functional-type(operands, results)

Op corresponding to MEM_ZEROS_BASE macro in kernels.

Interfaces: InferTypeOpInterface

Results:

Result	Description
`result`	32-bit signless integer

`ttkernel.mem_zeros_size` (tt::ttkernel::MemZerosSizeOp)

Op corresponding to MEM_ZEROS_SIZE macro in kernels.

Syntax:

operation ::= `ttkernel.mem_zeros_size` `(` `)` attr-dict `:` functional-type(operands, results)

Op corresponding to MEM_ZEROS_SIZE macro in kernels.

Interfaces: InferTypeOpInterface

Results:

Result	Description
`result`	32-bit signless integer

`ttkernel.mul_binary_tile_init` (tt::ttkernel::MulBinaryTilesInitOp)

Short init function

Syntax:

operation ::= `ttkernel.mul_binary_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before mul_binary_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.mul_binary_tile` (tt::ttkernel::MulBinaryTilesOp)

Multiplication operation between two tiles

Syntax:

operation ::= `ttkernel.mul_binary_tile` `(` $dst0_index `,` $dst1_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of multiplication operation DST[dst0_index] <- DST[dst0_index] * DST[dst1_index] on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_BinaryOpTrait, TTKernel_SFPUOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer
`dst1_index`	index or 32-bit signless integer

`ttkernel.mul_tiles_init` (tt::ttkernel::MulTilesInitOp)

Short init function

Syntax:

operation ::= `ttkernel.mul_tiles_init` `(` $in0_cb `,` $in1_cb `)` attr-dict `:` functional-type(operands, results)

Must be run before mul_tiles.

Traits: TTKernel_InitOpTrait

Operands:

Operand	Description
`in0_cb`	TTKernel cb
`in1_cb`	TTKernel cb

`ttkernel.mul_tiles` (tt::ttkernel::MulTilesOp)

Mul operation

Syntax:

operation ::= `ttkernel.mul_tiles` `(` $in0_cb `,` $in1_cb `,` $in0_tile_index `,` $in1_tile_index `,` $dst_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise multiplication C=A*B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_BinaryOpTrait, TTKernel_FPUOpTrait

Operands:

Operand	Description
`in0_cb`	TTKernel cb
`in1_cb`	TTKernel cb
`in0_tile_index`	index or 32-bit signless integer
`in1_tile_index`	index or 32-bit signless integer
`dst_index`	index or 32-bit signless integer

`ttkernel.my_x` (tt::ttkernel::MyXOp)

MyX

Syntax:

operation ::= `ttkernel.my_x` `(` ($noc^)? `)` attr-dict `:` functional-type(operands, results)

Lowers to the tt-metal supported MY_X macro. This represents the virtual X coordinate of the current core.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`noc`	8-bit signless integer

Results:

Result	Description
`x`	index

`ttkernel.my_y` (tt::ttkernel::MyYOp)

MyY

Syntax:

operation ::= `ttkernel.my_y` `(` ($noc^)? `)` attr-dict `:` functional-type(operands, results)

Lowers to the tt-metal supported MY_Y macro. This represents the virtual Y coordinate of the current core.

Traits: AlwaysSpeculatableImplTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`noc`	8-bit signless integer

Results:

Result	Description
`y`	index

`ttkernel.negative_tile_init` (tt::ttkernel::NegativeTileInitOp)

Short init function which configures compute unit for execution of negative_tile.

Syntax:

operation ::= `ttkernel.negative_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before negative_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.negative_tile` (tt::ttkernel::NegativeTileOp)

Negative operation

Syntax:

operation ::= `ttkernel.negative_tile` `(` $tile_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of the negative on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`tile_index`	index or 32-bit signless integer

`ttkernel.noc_async_read_barrier` (tt::ttkernel::NocAsyncReadBarrierOp)

NocAsyncReadBarrier

Syntax:

operation ::= `ttkernel.noc_async_read_barrier` `(` `)` attr-dict `:` functional-type(operands, results)

NocAsyncReadBarrier

`ttkernel.noc_async_read_one_packet_set_state` (tt::ttkernel::NocAsyncReadOnePacketSetStateOp)

NocAsyncReadOnePacketSetState

Syntax:

operation ::= `ttkernel.noc_async_read_one_packet_set_state` `(` $srcNocAddr `,` $size `)` attr-dict `:` functional-type(operands, results)

NocAsyncReadOnePacketSetState

Operands:

Operand	Description
`srcNocAddr`	TTKernel noc address
`size`	32-bit signless integer

`ttkernel.noc_async_read_one_packet_with_state` (tt::ttkernel::NocAsyncReadOnePacketWithStateOp)

NocAsyncReadOnePacketWithState

Syntax:

operation ::= `ttkernel.noc_async_read_one_packet_with_state` `(` $srcNocAddr `,` $dstLocalL1Addr `)` attr-dict `:` functional-type(operands, results)

NocAsyncReadOnePacketWithState

Operands:

Operand	Description
`srcNocAddr`	TTKernel noc address
`dstLocalL1Addr`	32-bit signless integer or TTKernel l1 address

`ttkernel.noc_async_read` (tt::ttkernel::NocAsyncReadOp)

NocAsyncRead

Syntax:

operation ::= `ttkernel.noc_async_read` `(` $srcNocAddr `,` $dstLocalL1Addr `,` $size `)` attr-dict `:` functional-type(operands, results)

NocAsyncRead

Operands:

Operand	Description
`srcNocAddr`	TTKernel noc address
`dstLocalL1Addr`	32-bit signless integer
`size`	32-bit signless integer

`ttkernel.noc_async_read_tile` (tt::ttkernel::NocAsyncReadTileOp)

NocAsyncReadTile

Syntax:

operation ::= `ttkernel.noc_async_read_tile` `(` $id `,` $addrGenStruct `,` $dstLocalL1Addr `)` attr-dict `:` functional-type(operands, results)

NocAsyncReadTile

Operands:

Operand	Description
`id`	32-bit signless integer
`addrGenStruct`	TTKernel InterleavedAddrGenFast type or TensorAccessor type
`dstLocalL1Addr`	32-bit signless integer

`ttkernel.noc_async_write_barrier` (tt::ttkernel::NocAsyncWriteBarrierOp)

NocAsyncWriteBarrier

Syntax:

operation ::= `ttkernel.noc_async_write_barrier` `(` `)` attr-dict `:` functional-type(operands, results)

NocAsyncWriteBarrier

`ttkernel.noc_async_write_multicast_loopback_src` (tt::ttkernel::NocAsyncWriteMulticastLoopbackSrcOp)

NocAsyncWriteMulticastLoopbackSrc

Syntax:

operation ::= `ttkernel.noc_async_write_multicast_loopback_src` `(` $srcLocalL1Addr `,` $dstNocAddrMulticast `,` $size `,` $num_dests (`,` $linked^)? (`,` $multicast_path_reserve^)? (`,` $noc^)? `)` attr-dict `:` functional-type(operands, results)

NocAsyncWriteMulticastLoopbackSrc

Attributes:

Attribute	MLIR Type	Description
`linked`	::mlir::BoolAttr	bool attribute
`multicast_path_reserve`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`srcLocalL1Addr`	32-bit signless integer
`dstNocAddrMulticast`	TTKernel noc address
`size`	32-bit signless integer
`num_dests`	32-bit signless integer
`noc`	8-bit signless integer

`ttkernel.noc_async_write_multicast_one_packet` (tt::ttkernel::NocAsyncWriteMulticastOnePacketOp)

NocAsyncWriteMulticastOnePacket

Syntax:

operation ::= `ttkernel.noc_async_write_multicast_one_packet` `(` $srcLocalL1Addr `,` $dstNocAddrMulticast `,` $size `,` $num_dests (`,` $linked^)? (`,` $multicast_path_reserve^)? (`,` $noc^)? `)` attr-dict `:` functional-type(operands, results)

NocAsyncWriteMulticastOnePacket this issues only a single packet with size <= NOC_MAX_BURST_SIZE (ie maximum packet size)

Attributes:

Attribute	MLIR Type	Description
`linked`	::mlir::BoolAttr	bool attribute
`multicast_path_reserve`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`srcLocalL1Addr`	32-bit signless integer
`dstNocAddrMulticast`	TTKernel noc address
`size`	32-bit signless integer
`num_dests`	32-bit signless integer
`noc`	8-bit signless integer

`ttkernel.noc_async_write_multicast` (tt::ttkernel::NocAsyncWriteMulticastOp)

NocAsyncWriteMulticast

Syntax:

operation ::= `ttkernel.noc_async_write_multicast` `(` $srcLocalL1Addr `,` $dstNocAddrMulticast `,` $size `,` $num_dests (`,` $linked^)? (`,` $multicast_path_reserve^)? (`,` $noc^)? `)` attr-dict `:`functional-type(operands, results)

The destination nodes can only be a set of Tensix cores + L1 memory address. The destination nodes must form a rectangular grid. The destination L1 memory address must be the same on all destination nodes.

With this API, the multicast sender cannot be part of the multicast destinations. If the multicast sender has to be in the multicast destinations (i.e. must perform a local L1 write), the other API variant noc_async_write_multicast_loopback_src can be used.

Note: The number of destinations needs to be non-zero. Besides that, there is no restriction on the number of destinations, i.e. the multicast destinations can span the full chip. However, as mentioned previously, the multicast source cannot be part of the destinations. So, the maximum number of destinations is 119.

Attributes:

Attribute	MLIR Type	Description
`linked`	::mlir::BoolAttr	bool attribute
`multicast_path_reserve`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`srcLocalL1Addr`	32-bit signless integer
`dstNocAddrMulticast`	TTKernel noc address
`size`	32-bit signless integer
`num_dests`	32-bit signless integer
`noc`	8-bit signless integer

`ttkernel.noc_async_write` (tt::ttkernel::NocAsyncWriteOp)

NocAsyncWrite

Syntax:

operation ::= `ttkernel.noc_async_write` `(` $srcLocalL1Addr `,` $dstNocAddr `,` $size `)` attr-dict `:` functional-type(operands, results)

NocAsyncWrite

Operands:

Operand	Description
`srcLocalL1Addr`	32-bit signless integer
`dstNocAddr`	TTKernel noc address
`size`	32-bit signless integer

`ttkernel.noc_async_write_tile` (tt::ttkernel::NocAsyncWriteTileOp)

NocAsyncWriteTile

Syntax:

operation ::= `ttkernel.noc_async_write_tile` `(` $id `,` $addrGenStruct `,` $srcLocalL1Addr `)` attr-dict `:` functional-type(operands, results)

NocAsyncWriteTilie

Operands:

Operand	Description
`id`	index or 32-bit signless integer
`addrGenStruct`	TTKernel InterleavedAddrGenFast type or TensorAccessor type
`srcLocalL1Addr`	32-bit signless integer

`ttkernel.noc_semaphore_inc` (tt::ttkernel::NocSemaphoreIncOp)

NocSemaphoreInc

Syntax:

operation ::= `ttkernel.noc_semaphore_inc` `(` $addr `,` $incr (`,` $noc_id^)? `)` attr-dict `:` functional-type(operands, results)

The Tensix core executing this function call initiates an atomic increment (with 32-bit wrap) of a remote Tensix core L1 memory address. This L1 memory address is used as a semaphore of size 4 Bytes, as a synchronization mechanism.

Operands:

Operand	Description
`addr`	TTKernel noc address
`incr`	index or 32-bit signless integer
`noc_id`	8-bit signless integer

`ttkernel.noc_semaphore_set_multicast_loopback_src` (tt::ttkernel::NocSemaphoreSetMulticastLoopbackOp)

NocSemaphoreSetMulticastLoopback

Syntax:

operation ::= `ttkernel.noc_semaphore_set_multicast_loopback_src` `(` $src_local_l1_addr `,` $dst_noc_addr_multicast `,` $num_dests `,` $linked `,` $multicast_path_reserve `)` attr-dict `:` functional-type(operands, results)

Initiates an asynchronous write from a source address in L1 memory on the Tensix core executing this function call to a rectangular destination grid. The destinations are specified using a uint64_t encoding referencing an on-chip grid of nodes located at NOC coordinate range (x_start,y_start,x_end,y_end) and a local address created using get_noc_multicast_addr function. The size of data that is sent is 4 Bytes. This is usually used to set a semaphore value at the destination nodes, as a way of a synchronization mechanism. The same as noc_async_write_multicast with preset size of 4 Bytes. Note: With this API, sending data only to the source node (when num_dests is 1) may result in unexpected behaviour. For some parameters, hangs have been observed. For some other parameters, nothing may happen. Consider using regular non multicast operations such as noc_async_write in this case.

Attributes:

Attribute	MLIR Type	Description
`linked`	::mlir::BoolAttr	bool attribute
`multicast_path_reserve`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`src_local_l1_addr`	TTKernel semaphore
`dst_noc_addr_multicast`	TTKernel noc address
`num_dests`	32-bit signless integer

`ttkernel.noc_semaphore_set_multicast` (tt::ttkernel::NocSemaphoreSetMulticastOp)

NocSemaphoreSetMulticast

Syntax:

operation ::= `ttkernel.noc_semaphore_set_multicast` `(` $src_local_l1_addr `,` $dst_noc_addr_multicast `,` $num_dests (`,` $linked^)? (`,` $multicast_path_reserve^)? `)` attr-dict `:` functional-type(operands, results)

Initiates an asynchronous write from a source address in L1 memory on the Tensix core executing this function call to a rectangular destination grid. The destinations are specified using a uint64_t encoding referencing an on-chip grid of nodes located at NOC coordinate range (x_start,y_start,x_end,y_end) and a local address created using get_noc_multicast_addr function. The size of data that is sent is 4 Bytes. This is usually used to set a semaphore value at the destination nodes, as a way of a synchronization mechanism. The same as noc_async_write_multicast with preset size of 4 Bytes. With this API, the multicast sender cannot be part of the multicast destinations. If the multicast sender has to be in the multicast destinations (i.e. must perform a local L1 write), the other API variant noc_semaphore_set_multicast_loopback_src can be used.

Attributes:

Attribute	MLIR Type	Description
`linked`	::mlir::BoolAttr	bool attribute
`multicast_path_reserve`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`src_local_l1_addr`	TTKernel semaphore
`dst_noc_addr_multicast`	TTKernel noc address
`num_dests`	32-bit signless integer

`ttkernel.noc_semaphore_set` (tt::ttkernel::NocSemaphoreSetOp)

NocSemaphoreSet

Syntax:

operation ::= `ttkernel.noc_semaphore_set` `(` $sem_addr `,` $val `)` attr-dict `:` functional-type(operands, results)

Sets the value of a local L1 memory address on the Tensix core executing this function to a specific value. This L1 memory address is used as a semaphore of size 4 Bytes, as a synchronization mechanism. Also, see noc_semaphore_wait.

Operands:

Operand	Description
`sem_addr`	TTKernel l1 address pointer
`val`	index or 32-bit signless integer

`ttkernel.noc_semaphore_wait_min` (tt::ttkernel::NocSemaphoreWaitMinOp)

NocSemaphoreWaitMin

Syntax:

operation ::= `ttkernel.noc_semaphore_wait_min` `(` $sem_addr `,` $val `)` attr-dict `:` functional-type(operands, results)

A blocking call that waits until the value of a local L1 memory address on the Tensix core executing this function becomes equal or greater than a target value. This L1 memory address is used as a semaphore of size 4 Bytes, as a synchronization mechanism. Also, see noc_semaphore_set.

Operands:

Operand	Description
`sem_addr`	TTKernel l1 address pointer
`val`	32-bit signless integer

`ttkernel.noc_semaphore_wait` (tt::ttkernel::NocSemaphoreWaitOp)

NocSemaphoreWait

Syntax:

operation ::= `ttkernel.noc_semaphore_wait` `(` $sem_addr `,` $val `)` attr-dict `:` functional-type(operands, results)

A blocking call that waits until the value of a local L1 memory address on the Tensix core executing this function becomes equal to a target value. This L1 memory address is used as a semaphore of size 4 Bytes, as a synchronization mechanism. Also, see noc_semaphore_set.

Operands:

Operand	Description
`sem_addr`	TTKernel l1 address pointer
`val`	index or 32-bit signless integer

`ttkernel.pack_tile` (tt::ttkernel::PackTileOp)

PackTile op.

Syntax:

operation ::= `ttkernel.pack_tile` `(` $dst_index `,` $out_cb `,` $out_index `,` $out_of_order`)` attr-dict `:` functional-type(operands, results)

Copies a single tile from the DST register buffer at a specified index to a specified CB at a given index. For the out_tile_index to be valid for this call, cb_reserve_back(n) has to be called first to reserve at least some number n > 0 of tiles in the output CB. out_tile_index = 0 then references the first tile in the reserved section of the CB, up to index n - 1, which will then be visible to the consumer in the same order after a cb_push_back call. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Each subsequent pack call will increment the write pointer in the cb by single tile size. The pointer is then again set to a valid position with space for n reserved tiles by another cb_reserve_back call.

Operates in tandem with functions cb_reserve_back and cb_push_back.

A typical use case is first the producer ensures that there is a number of tiles available in the buffer via cb_reserve_back, then the producer uses the pack_tile call to copy a tile from one of DST slots to a slot in reserved space and finally cb_push_back is called to announce visibility of the reserved section of the circular buffer to the consumer.

Attributes:

Attribute	MLIR Type	Description
`out_of_order`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`dst_index`	index or 32-bit signless integer
`out_cb`	TTKernel cb
`out_index`	index or 32-bit signless integer

`ttkernel.power_binary_tile_init` (tt::ttkernel::PowBinaryTilesInitOp)

Short init function which configures compute unit for execution of power_binary_tile.

Syntax:

operation ::= `ttkernel.power_binary_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before power_binary_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.power_binary_tile` (tt::ttkernel::PowBinaryTilesOp)

Power operation between two tiles

Syntax:

operation ::= `ttkernel.power_binary_tile` `(` $dst0_index `,` $dst1_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of power operation DST[dst0_index] <- DST[dst0_index] ^ DST[dst1_index] on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_BinaryOpTrait, TTKernel_SFPUOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer
`dst1_index`	index or 32-bit signless integer

`ttkernel.recip_tile_init` (tt::ttkernel::RecipTileInitOp)

Init function for recip_tile operation. Refer to documentation for any init function.

Syntax:

operation ::= `ttkernel.recip_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be called before recip_tile function.

Traits: TTKernel_InitOpTrait

`ttkernel.recip_tile` (tt::ttkernel::RecipTileOp)

Recip tile in the DST at specified index.

Syntax:

operation ::= `ttkernel.recip_tile` `(` $tile_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of the reciprocal on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine. Only works for Float32, Float16_b, Bfp8_b data formats for full accuracy.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`tile_index`	index or 32-bit signless integer

`ttkernel.reduce_init` (tt::ttkernel::ReduceInitOp)

Init function

Syntax:

operation ::= `ttkernel.reduce_init` `(` $in_cb `,` $scaling_cb `,` $out_cb `,` $reduce_type `,` $reduce_dim `)` attr-dict `:` functional-type(operands, results)

Must be run before reduce_tile.

Traits: TTKernel_InitOpTrait

Attributes:

Attribute	MLIR Type	Description
`reduce_type`	::mlir::tt::ttkernel::ReduceTypeAttr	TTKernel Reduce Types
`reduce_dim`	::mlir::tt::ttkernel::ReduceDimAttr	TTKernel Reduce Dimensions

Operands:

Operand	Description
`in_cb`	TTKernel cb
`scaling_cb`	TTKernel cb
`out_cb`	TTKernel cb

`ttkernel.reduce_tile` (tt::ttkernel::ReduceTileOp)

Reduce operation

Syntax:

operation ::= `ttkernel.reduce_tile` `(` $in_cb `,` $scaling_cb `,` $in_tile_index `,` $scaling_tile_index `,` $dst_index `,` $reduce_type `,` $reduce_dim `)` attr-dict `:` functional-type(operands, results)

Performs a reduction operation B = reduce(A) using reduce_func for dimension reduction on a tile in the CB at a given index and writes the result to the DST register at index dst_tile_index. Reduction can be either of type Reduce::R, Reduce::C or Reduce::RC, identifying the dimension(s) to be reduced in size to 1. The DST register buffer must be in acquired state via tile_regs_acquire call. The templates takes reduce_type which can be ReduceFunc::Sum, ReduceFunc::Max and reduce_dim which can be Reduce::R, Reduce::C, Reduce::RC. They can also be specified by defines REDUCE_OP and REDUCE_DIM. This call is blocking and is only available on the compute engine.

Traits: TTKernel_BinaryOpTrait, TTKernel_FPUOpTrait

Attributes:

Attribute	MLIR Type	Description
`reduce_type`	::mlir::tt::ttkernel::ReduceTypeAttr	TTKernel Reduce Types
`reduce_dim`	::mlir::tt::ttkernel::ReduceDimAttr	TTKernel Reduce Dimensions

Operands:

Operand	Description
`in_cb`	TTKernel cb
`scaling_cb`	TTKernel cb
`in_tile_index`	index or 32-bit signless integer
`scaling_tile_index`	index or 32-bit signless integer
`dst_index`	index or 32-bit signless integer

`ttkernel.rounding_op_tile_init` (tt::ttkernel::RoundingTileInitOp)

Init function for ceil/floor/round_tile operation. Refer to documentation for any init function.

Syntax:

operation ::= `ttkernel.rounding_op_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before ceil/floor/round_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.rsqrt_tile_init` (tt::ttkernel::RsqrtTileInitOp)

Short init function which configures compute unit for execution of rsqrt_tile.

Syntax:

operation ::= `ttkernel.rsqrt_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before rsqrt_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.rsqrt_tile` (tt::ttkernel::RsqrtTileOp)

Rsqrt operation

Syntax:

operation ::= `ttkernel.rsqrt_tile` `(` $tile_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of reciprocal sqrt on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`tile_index`	index or 32-bit signless integer

`ttkernel.sigmoid_tile_init` (tt::ttkernel::SigmoidTileInitOp)

Short init function which configures compute unit for execution of sigmoid_tile.

Syntax:

operation ::= `ttkernel.sigmoid_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before sigmoid_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.sigmoid_tile` (tt::ttkernel::SigmoidTileOp)

Sigmoid operation

Syntax:

operation ::= `ttkernel.sigmoid_tile` `(` $tile_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of sigmoid on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`tile_index`	index or 32-bit signless integer

`ttkernel.sin_tile_init` (tt::ttkernel::SinTileInitOp)

Init function for sin_tile operation. Refer to documentation for any init function.

Syntax:

operation ::= `ttkernel.sin_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before sin_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.sin_tile` (tt::ttkernel::SinTileOp)

Sine tile in the DST at specified index.

Syntax:

operation ::= `ttkernel.sin_tile` `(` $dst0_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of sine operation DST[dst0_index] <- sin(DST[dst0_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer

`ttkernel.sqrt_tile_init` (tt::ttkernel::SqrtTileInitOp)

Short init function which configures compute unit for execution of sqrt_tile.

Syntax:

operation ::= `ttkernel.sqrt_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before sqrt_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.sqrt_tile` (tt::ttkernel::SqrtTileOp)

Sqrt operation

Syntax:

operation ::= `ttkernel.sqrt_tile` `(` $tile_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of sqrt on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`tile_index`	index or 32-bit signless integer

`ttkernel.store_to_l1` (tt::ttkernel::StoreToL1Op)

StoreToL1

Syntax:

operation ::= `ttkernel.store_to_l1` `(` $value `,` $l1_ptr `,` $offset `)` attr-dict `:` functional-type(operands, results)

Store value to L1.

Operands:

Operand	Description
`value`	32-bit signless integer
`l1_ptr`	TTKernel l1 address pointer
`offset`	32-bit signless integer

`ttkernel.sub_binary_tile_init` (tt::ttkernel::SubBinaryTilesInitOp)

Short init function

Syntax:

operation ::= `ttkernel.sub_binary_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before sub_binary_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.sub_binary_tile` (tt::ttkernel::SubBinaryTilesOp)

Subtraction operation between two tiles

Syntax:

operation ::= `ttkernel.sub_binary_tile` `(` $dst0_index `,` $dst1_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of subtraction operation DST[dst0_index] <- DST[dst0_index] - DST[dst1_index] on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_BinaryOpTrait, TTKernel_SFPUOpTrait

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer
`dst1_index`	index or 32-bit signless integer

`ttkernel.sub_tiles_init` (tt::ttkernel::SubTilesInitOp)

Short init function

Syntax:

operation ::= `ttkernel.sub_tiles_init` `(` $in0_cb `,` $in1_cb `)` attr-dict `:` functional-type(operands, results)

Must be run before sub_tiles.

Traits: TTKernel_InitOpTrait

Operands:

Operand	Description
`in0_cb`	TTKernel cb
`in1_cb`	TTKernel cb

`ttkernel.sub_tiles` (tt::ttkernel::SubTilesOp)

Sub operation

Syntax:

operation ::= `ttkernel.sub_tiles` `(` $in0_cb `,` $in1_cb `,` $in0_tile_index `,` $in1_tile_index `,` $dst_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise subtraction C=A-B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_BinaryOpTrait, TTKernel_FPUOpTrait

Operands:

Operand	Description
`in0_cb`	TTKernel cb
`in1_cb`	TTKernel cb
`in0_tile_index`	index or 32-bit signless integer
`in1_tile_index`	index or 32-bit signless integer
`dst_index`	index or 32-bit signless integer

`ttkernel.tan_tile_init` (tt::ttkernel::TanTileInitOp)

Short init function which configures compute unit for execution of tan_tile.

Syntax:

operation ::= `ttkernel.tan_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before tan_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.tan_tile` (tt::ttkernel::TanTileOp)

Tan operation

Syntax:

operation ::= `ttkernel.tan_tile` `(` $tile_index `)` attr-dict `:` functional-type(operands, results)

Performs element-wise computation of the trigonometric tangent operation on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Operands:

Operand	Description
`tile_index`	index or 32-bit signless integer

`ttkernel.TensorAccessorArgs` (tt::ttkernel::TensorAccessorArgsOp)

TensorAccessorArgs

Syntax:

operation ::= `ttkernel.TensorAccessorArgs` `(` $cta_base `,` $crta_base `)` attr-dict `:` functional-type(operands, results)

TensorAccessorArgs struct constructor.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`cta_base`	32-bit signless integer
`crta_base`	32-bit signless integer

Results:

Result	Description
`result`	TensorAccessorArgs type

`ttkernel.tensor_accessor_get_bank_and_offset` (tt::ttkernel::TensorAccessorGetBankAndOffsetOp)

TensorAccessor's get_bank_and_offset

Syntax:

operation ::= `ttkernel.tensor_accessor_get_bank_and_offset` `(` $tensor_accessor `,` $page_id `)` attr-dict `:` functional-type(operands, results)

Returns bank id and page offset.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`tensor_accessor`	TensorAccessor type
`page_id`	32-bit signless integer

Results:

Result	Description
`bank_id_and_offset`	TensorAccessor PageMapping struct

`ttkernel.tensor_accessor_get_noc_addr` (tt::ttkernel::TensorAccessorGetNocAddrOp)

TensorAccessor's get_noc_addr

Syntax:

operation ::= `ttkernel.tensor_accessor_get_noc_addr` `(` $tensor_accessor `,` $id `,` $offset (`,` $noc^)? `)` attr-dict `:` functional-type(operands, results)

get_noc_addr using information stored in the TensorAccessor.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`tensor_accessor`	TensorAccessor type
`id`	32-bit signless integer
`offset`	32-bit signless integer
`noc`	8-bit signless integer

Results:

Result	Description
`nocAddr`	TTKernel noc address

`ttkernel.tensor_accessor_get_shard_noc_addr` (tt::ttkernel::TensorAccessorGetShardNocAddrOp)

TensorAccessor's get_shard_noc_addr

Syntax:

operation ::= `ttkernel.tensor_accessor_get_shard_noc_addr` `(` $tensor_accessor `,` $shard_id `,` $offset (`,` $noc^)? `)` attr-dict `:` functional-type(operands, results)

Returns noc addr of a shard.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`tensor_accessor`	TensorAccessor type
`shard_id`	32-bit signless integer
`offset`	32-bit signless integer
`noc`	8-bit signless integer

Results:

Result	Description
`shardNocAddr`	32-bit signless integer

`ttkernel.tensor_accessor_is_local_addr` (tt::ttkernel::TensorAccessorIsLocalAddrOp)

TensorAccessor's is_local_addr

Syntax:

operation ::= `ttkernel.tensor_accessor_is_local_addr` `(` $tensor_accessor `,` $virtual_x `,` $virtual_y (`,` $noc^)? `)` attr-dict `:` functional-type(operands, results)

Returns bool indicating addr locality.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`tensor_accessor`	TensorAccessor type
`virtual_x`	32-bit signless integer
`virtual_y`	32-bit signless integer
`noc`	8-bit signless integer

Results:

Result	Description
`result`	1-bit signless integer

`ttkernel.tensor_accessor_is_local_bank` (tt::ttkernel::TensorAccessorIsLocalBankOp)

TensorAccessor's is_local_bank

Syntax:

operation ::= `ttkernel.tensor_accessor_is_local_bank` `(` $tensor_accessor `,` $id `,` $offset (`,` $noc^)? `)` attr-dict `:` functional-type(operands, results)

Returns bool indicating bank locality.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`tensor_accessor`	TensorAccessor type
`id`	32-bit signless integer
`offset`	32-bit signless integer
`noc`	8-bit signless integer

Results:

Result	Description
`result`	1-bit signless integer

`ttkernel.tensor_accessor_is_local_page` (tt::ttkernel::TensorAccessorIsLocalPageOp)

TensorAccessor's is_local_page

Syntax:

operation ::= `ttkernel.tensor_accessor_is_local_page` `(` $tensor_accessor `,` $page_id (`,` $noc^)? `)` attr-dict `:` functional-type(operands, results)

Returns bool indicating page locality.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`tensor_accessor`	TensorAccessor type
`page_id`	32-bit signless integer
`noc`	8-bit signless integer

Results:

Result	Description
`result`	1-bit signless integer

`ttkernel.tensor_accessor_is_local_shard` (tt::ttkernel::TensorAccessorIsLocalShardOp)

TensorAccessor's is_local_shard

Syntax:

operation ::= `ttkernel.tensor_accessor_is_local_shard` `(` $tensor_accessor `,` $shard_id (`,` $noc^)? `)` attr-dict `:` functional-type(operands, results)

Returns bool indicating shard locality.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`tensor_accessor`	TensorAccessor type
`shard_id`	32-bit signless integer
`noc`	8-bit signless integer

Results:

Result	Description
`result`	1-bit signless integer

`ttkernel.TensorAccessor` (tt::ttkernel::TensorAccessorOp)

MakeTensorAccessorFromArgs

Syntax:

operation ::= `ttkernel.TensorAccessor` `(` $args `,` $bank_base_address_in `,` $page_size_in `)` attr-dict `:` functional-type(operands, results)

TensorAccessor constructor.

Interfaces: InferTypeOpInterface

Operands:

Operand	Description
`args`	TensorAccessorArgs type
`bank_base_address_in`	32-bit signless integer
`page_size_in`	32-bit signless integer

Results:

Result	Description
`result`	TensorAccessor type

`ttkernel.tile_regs_acquire` (tt::ttkernel::TileRegsAcquireOp)

Tile_regs_acquire

Syntax:

operation ::= `ttkernel.tile_regs_acquire` `(` `)` attr-dict `:` functional-type(operands, results)

Acquire an exclusive lock on the DST register for the MATH thread. This register is an array of 16 tiles of 32x32 elements each. This is a blocking function, i.e. this function will wait until the lock is acquired.

`ttkernel.tile_regs_commit` (tt::ttkernel::TileRegsCommitOp)

Tile_regs_commit

Syntax:

operation ::= `ttkernel.tile_regs_commit` `(` `)` attr-dict `:` functional-type(operands, results)

Release lock on DST register by MATH thread. The lock had to be previously acquired with tile_regs_acquire.

`ttkernel.tile_regs_release` (tt::ttkernel::TileRegsReleaseOp)

Tile_regs_release

Syntax:

operation ::= `ttkernel.tile_regs_release` `(` `)` attr-dict `:` functional-type(operands, results)

Release lock on DST register by PACK thread. The lock had to be previously acquired with tile_regs_wait.

`ttkernel.tile_regs_wait` (tt::ttkernel::TileRegsWaitOp)

Tile_regs_wait

Syntax:

operation ::= `ttkernel.tile_regs_wait` `(` `)` attr-dict `:` functional-type(operands, results)

Acquire an exclusive lock on the DST register for the PACK thread. It waits for the MATH thread to commit the DST register. This is a blocking function, i.e. this function will wait until the lock is acquired.

`ttkernel.tilize_block` (tt::ttkernel::TilizeBlockOp)

TilizeBlockOp call.

Syntax:

operation ::= `ttkernel.tilize_block` `(` $cbIn `,` $numTiles `,` $cbOut `)` attr-dict `:` functional-type(operands, results)

TilizeBlockOp operation

Operands:

Operand	Description
`cbIn`	TTKernel cb
`numTiles`	32-bit signless integer
`cbOut`	TTKernel cb

`ttkernel.tilize_init` (tt::ttkernel::TilizeInitOp)

TilizeInitOp call.

Syntax:

operation ::= `ttkernel.tilize_init` `(` $cbIn `,` $numTiles `,` $cbOut `)` attr-dict `:` functional-type(operands, results)

Initialize the tilize operation. To be called once at beginning of a kernel.

Operands:

Operand	Description
`cbIn`	TTKernel cb
`numTiles`	32-bit signless integer
`cbOut`	TTKernel cb

`ttkernel.tilize_uninit` (tt::ttkernel::TilizeUninitOp)

TilizeUninitOp call.

Syntax:

operation ::= `ttkernel.tilize_uninit` `(` $cbI `,` $cbOut `)` attr-dict `:` functional-type(operands, results)

Uninitialize tilize operation before re-initializing for another operation.

Operands:

Operand	Description
`cbI`	TTKernel cb
`cbOut`	TTKernel cb

`ttkernel.typecast_tile_init` (tt::ttkernel::TypecastTileInitOp)

Init function for typecast_tile operation. Refer to documentation for any init function.

Syntax:

operation ::= `ttkernel.typecast_tile_init` `(` `)` attr-dict `:` functional-type(operands, results)

Must be run before typecast_tile.

Traits: TTKernel_InitOpTrait

`ttkernel.typecast_tile` (tt::ttkernel::TypecastTileOp)

Cast the dataformat of the tile in the DST at specified index.

Syntax:

operation ::= `ttkernel.typecast_tile` `(` $dst0_index `,` $in_dtype `,` $out_dtype `)` attr-dict `:` functional-type(operands, results)

Performs element-wise typecast operation DST[dst0_index] <- typecast<in_dataformat, out_dataformat>(DST[dst0_index]) on DST register operands. The DST register buffer must be in acquired state via tile_regs_acquire call.

Traits: TTKernel_SFPUOpTrait, TTKernel_UnaryOpTrait

Attributes:

Attribute	MLIR Type	Description
`in_dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`out_dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes

Operands:

Operand	Description
`dst0_index`	index or 32-bit signless integer

`ttkernel.unary_op_init_common` (tt::ttkernel::UnaryOpInitCommonOp)

Initialization function for unary operations.

Syntax:

operation ::= `ttkernel.unary_op_init_common` `(` $icb `,` $ocb `)` attr-dict `:` functional-type(operands, results)

This operation initializes all necessary components for unary operations, including unpacking, packing, and math configurations.

Traits: TTKernel_InitOpTrait

Operands:

Operand	Description
`icb`	TTKernel cb
`ocb`	TTKernel cb

`ttkernel.unreachable` (tt::ttkernel::UnreachableOp)

Unreachable op.

Syntax:

operation ::= `ttkernel.unreachable` `(` `)` attr-dict `:` functional-type(operands, results)

Unreachable operation

Traits: AlwaysSpeculatableImplTrait, ReturnLike, Terminator

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), RegionBranchTerminatorOpInterface

Effects: MemoryEffects::Effect{}

`ttkernel.untilize_block` (tt::ttkernel::UntilizeBlockOp)

UntilizeBlockOp call.

Syntax:

operation ::= `ttkernel.untilize_block` `(` $cbIn `,` $numTiles `,` $cbOut `)` attr-dict `:` functional-type(operands, results)

UntilizeBlockOp operation

Operands:

Operand	Description
`cbIn`	TTKernel cb
`numTiles`	32-bit signless integer
`cbOut`	TTKernel cb

`ttkernel.untilize_init` (tt::ttkernel::UntilizeInitOp)

UntilizeInitOp call.

Syntax:

operation ::= `ttkernel.untilize_init` `(` $cbIn `)` attr-dict `:` functional-type(operands, results)

Init function for untilize operations, to be used at the beginning of the kernel.

Operands:

Operand	Description
`cbIn`	TTKernel cb

`ttkernel.untilize_uninit` (tt::ttkernel::UntilizeUninitOp)

UntilizeUninitOp call.

Syntax:

operation ::= `ttkernel.untilize_uninit` `(` $cbIn `)` attr-dict `:` functional-type(operands, results)

Uninitialize untilize operation, to allow initializing another operation.

Operands:

Operand	Description
`cbIn`	TTKernel cb

CBType

TTKernel cb

Syntax:

!ttkernel.cb<
  MemRefType   # memref
>

Circular buffer type in TTKernel dialect

Parameters:

Parameter	C++ type	Description
memref	`MemRefType`

DataFormatType

TTKernel compute data format type

Syntax: !ttkernel.DataFormat

Data format type in TTKernel dialect

InterleavedAddrGenFastType

TTKernel InterleavedAddrGenFast type

Syntax: !ttkernel.interleaved_addr_gen_fast

InterleavedAddrGenFast type in TTKernel dialect

L1AddrType

TTKernel l1 address

Syntax: !ttkernel.l1_addr

L1 address type in TTKernel dialect

L1AddrPtrType

TTKernel l1 address pointer

Syntax: !ttkernel.l1_addr_ptr

L1 pointer address type in TTKernel dialect

NocAddrType

TTKernel noc address

Syntax: !ttkernel.noc_addr

Noc address type in TTKernel dialect

SemaphoreType

TTKernel semaphore

Syntax: !ttkernel.semaphore

Semaphore type in TTKernel dialect

TensorAccessorType

TensorAccessor type

Syntax: !ttkernel.TensorAccessor

Accessor that encapsulates logic to access tensor information

TensorAccessorArgsType

TensorAccessorArgs type

Syntax: !ttkernel.TensorAccessorArgs

TensorAccessor args type that stores compile + runtime information

TensorAccessorPageMappingType

TensorAccessor PageMapping struct

Syntax: !ttkernel.PageMapping

TensorAccessor struct that holds bank_id and bank_page_offset

'ttmetal' Dialect

A TTMetal out-of-tree MLIR dialect.

This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

`ttmetal.create_buffer` (tt::ttmetal::CreateBufferOp)

Create buffer op.

Create buffer operation

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Allocate on ::mlir::SideEffects::DefaultResource}

Attributes:

Attribute	MLIR Type	Description
`address`	::mlir::IntegerAttr	64-bit signless integer attribute

Results:

Result	Description
`result`	non-0-ranked.memref of any type values

`ttmetal.deallocate_buffer` (tt::ttmetal::DeallocateBufferOp)

Deallocate buffer op.

Deallocate buffer operation

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Free on ::mlir::SideEffects::DefaultResource}

Operands:

Operand	Description
`input`	non-0-ranked.memref of any type values

`ttmetal.enqueue_program` (tt::ttmetal::EnqueueProgramOp)

Enqueue program op.

Enqueue program operation

Traits: AttrSizedOperandSegments

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Attributes:

Attribute	MLIR Type	Description
`cb_ports`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`kernelConfigs`	::mlir::ArrayAttr

Operands:

Operand	Description
`buffers`	variadic of non-0-ranked.memref of any type values
`cbs`	variadic of non-0-ranked.memref of any type values

`ttmetal.enqueue_read_buffer` (tt::ttmetal::EnqueueReadBufferOp)

Enqueue read buffer op.

Enqueue read buffer operation

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Operands:

Operand	Description
`input`	non-0-ranked.memref of any type values
`output`	non-0-ranked.memref of any type values

`ttmetal.enqueue_write_buffer` (tt::ttmetal::EnqueueWriteBufferOp)

Enqueue write buffer op.

Enqueue write buffer operation

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Operands:

Operand	Description
`input`	non-0-ranked.memref of any type values
`output`	non-0-ranked.memref of any type values

`ttmetal.finish` (tt::ttmetal::FinishOp)

Finish op for command queue.

Global barrier op, used to wait for all commands on queue to finish.

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface)

Effects: MemoryEffects::Effect{MemoryEffects::Read on ::mlir::SideEffects::DefaultResource, MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

'ttnn' Dialect

A TTNN out-of-tree MLIR dialect.

This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.

[TOC]

`ttnn.abs` (tt::ttnn::AbsOp)

Eltwise absolute.

Eltwise absolute operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.add` (tt::ttnn::AddOp)

Eltwise add.

Eltwise add operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.all_gather` (tt::ttnn::AllGatherOp)

All gather op.

Tensor All Gather operation

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`all_gather_dim`	::mlir::IntegerAttr	32-bit signed integer attribute
`cluster_axis`	::mlir::IntegerAttr	32-bit unsigned integer attribute
`num_links`	::mlir::IntegerAttr	32-bit unsigned integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.all_reduce` (tt::ttnn::AllReduceOp)

All reduce op.

Tensor All Reduce operation

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`reduce_type`	::mlir::tt::ttcore::ReduceTypeAttr	TT Reduce Type
`cluster_axis`	::mlir::IntegerAttr	32-bit unsigned integer attribute
`num_links`	::mlir::IntegerAttr	32-bit unsigned integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.alloc` (tt::ttnn::AllocOp)

Alloc op.

Tensor Alloc operation

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`address`	::mlir::IntegerAttr	64-bit signless integer attribute
`size`	::mlir::IntegerAttr	64-bit signless integer attribute
`buffer_type`	::mlir::tt::ttnn::BufferTypeAttr	TTNN Buffer Type

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.arange` (tt::ttnn::ArangeOp)

Arange operation.

Tensor arange operation.

Produces a (1, 1, 1, N)-shaped tensor with values from start to end (exclusive) with a step size of step.

Examples: %0 = "ttnn.arange"() {start = 0 : i64, end = 5 : i64 step = 1 : i64} : () -> tensor<1x1x1x5xi64> // %0: [[[[0, 1, 2, 3, 4]]]]

%1 = "ttnn.arange"() {start = 0 : i64, end = 10 : i64, step = 2 : i64} : () -> tensor<1x1x1x5xf32> // %1: [[[[0.0, 2.0, 4.0, 6.0, 8.0]]]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`start`	::mlir::IntegerAttr	64-bit signless integer attribute
`end`	::mlir::IntegerAttr	64-bit signless integer attribute
`step`	::mlir::IntegerAttr	64-bit signless integer attribute
`dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}

Operands:

Operand	Description
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.argmax` (tt::ttnn::ArgMaxOp)

Argmax reduction op.

Determine the indices of the maximum values along a specified dimension of a tensor or over all elements in a tensor.

Parameters:

input: The input tensor.
dim: Specifies the dimension along which the argmax is applied.
keep_dim: If set to true, the output tensor will have the same number of dimensions as the input tensor.
use_multicore: Whether to use multiple cores or not.

IR usage: // Input tensor of shape (128, 28, 28, 64) %input = ... : tensor<128x28x28x64xbf16>

%empty = "ttnn.empty"(%0) <{dtype = #ttcore.supportedDataTypes, ....}> : -> tensor<128x28x28xi32> %4 = "ttnn.argmax"(%input, %empty) <{dim = 3 : i32, use_multicore = false}> : (tensor<128x28x28xbf16>, tensor<128x28x28xi32) -> tensor<128x28x28xi32>

Example: input: [[1, 5, 3], [2, 4, 6]]

// Computing along dim 0 output: [1, 0, 1]

// Computing along dim 1 output: [1, 2]

// Computing for entire tensor output: 5

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dim`	::mlir::IntegerAttr	32-bit signless integer attribute
`keep_dim`	::mlir::BoolAttr	bool attribute
`use_multicore`	::mlir::BoolAttr	bool attribute
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.atan2` (tt::ttnn::Atan2Op)

Eltwise atan2 OP.

Performs element-wise atan2 operation on lhs and rhs tensor and produces a result tensor.

Example:

  // %lhs: [0.0, 1.0, -1.0]
  // %rhs: [1.0, 0.0, 0.0]
  %result = "ttnn.atan2"(%lhs, %rhs) : (tensor<3xf64>, tensor<3xf64>) -> tensor<3xf64>
  // %result: [0.0, 1.57079637, -1.57079637] // [0.0, pi/2, -pi/2]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.atan` (tt::ttnn::AtanOp)

Eltwise arctangent op.

Performs an elementwise arctangent (atan) operation on the input tensor. This operation computes the inverse tangent of each element, returning values in the range [-π/2, π/2]. Supports floating-point tensor types.

Example:

%input = tensor<4xf32> {1.0, 0.5, 0.0, -1.0}
%result = "ttir.atan"(%input) : (tensor<4xf32>) -> tensor<4xf32>

Given the input [1.0, 0.5, 0.0, -1.0], the result would be approximately: [0.785, 0.464, 0.0, -0.785] (values in radians).

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.avg_pool2d` (tt::ttnn::AvgPool2dOp)

Applies a 2D average pooling over an input signal composed of several input planes.

It is a downsampling operation to reduce the spatial dimensions (height and width) of a input tensor by computing averages with in a window.

Example: // 3x3 input tensor input: [[1, 2, 3], [4, 5, 6], [7, 8, 9]] kernel_height: 2 kernel_width: 2 stride_height: 1 stride_width: 1 dilation_height: 1 dilation_width: 1 output: [[3, 4], [6, 7]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`batch_size`	::mlir::IntegerAttr	32-bit signed integer attribute
`input_height`	::mlir::IntegerAttr	32-bit signed integer attribute
`input_width`	::mlir::IntegerAttr	32-bit signed integer attribute
`channels`	::mlir::IntegerAttr	32-bit signed integer attribute
`kernel_size`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`stride`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`padding`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`dilation`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}
`applied_shard_scheme`	::mlir::tt::ttnn::TensorMemoryLayoutAttr	TTNN Tensor Memory Layout
`ceil_mode`	::mlir::BoolAttr	bool attribute
`in_place_halo`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.batch_norm` (tt::ttnn::BatchNormOp)

Batch normalization op.

Batch normalization operation over each channel on input tensor.

Traits: AlwaysSpeculatableImplTrait, AttrSizedOperandSegments, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`training`	::mlir::BoolAttr	bool attribute
`epsilon`	::mlir::FloatAttr	32-bit float attribute
`momentum`	::mlir::FloatAttr	32-bit float attribute
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`running_mean`	ranked tensor of any type values
`running_var`	ranked tensor of any type values
`weight`	ranked tensor of any type values
`bias`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.begin_trace_capture` (tt::ttnn::BeginTraceCaptureOp)

Begin trace capture.

Begins trace capture. Returns a scalar tensor containing the trace id. Inputs:

device TTNN_Device: The device to capture the trace on.
cq_id ui32: The command queue to capture the trace with. Must be 0 or 1. Outputs:
trace_id AnyRankedTensor: The scalar trace id tensor containing the trace id.

Interfaces: MemoryEffectOpInterface, TTNN_WorkaroundInterface

Attributes:

Attribute	MLIR Type	Description
`cq_id`	::mlir::IntegerAttr	32-bit unsigned integer attribute

Operands:

Operand	Description
`device`	TTNN device

Results:

Result	Description
`trace_id`	ranked tensor of any type values

`ttnn.bitwise_and` (tt::ttnn::BitwiseAndOp)

Eltwise bitwise AND.

Performs element-wise bitwise AND of two tensors lhs and rhs and produces a result tensor.

Example: // %lhs: [[1, 2], [3, 4]] // %rhs: [[5, 6], [7, 8]] %result = "ttnn.bitwise_and"(%lhs, %rhs) : (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32> // %result: [[1, 2], [3, 0]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.bitwise_not` (tt::ttnn::BitwiseNotOp)

Eltwise bitwise NOT.

Performs element-wise NOT of tensor operand and produces a result tensor.

Example: // Bitwise operation with with integer tensors // %operand: [[1, 2], [3, 4]] %result = "ttnn.bitwise_not"(%operand) : (tensor<2x2xi32>) -> tensor<2x2xi32> // %result: [[-2, -3], [-4, -5]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.bitwise_or` (tt::ttnn::BitwiseOrOp)

Eltwise bitwise OR.

Performs element-wise bitwise OR of two tensors lhs and rhs and produces a result tensor.

Example: // %lhs: [[1, 2], [3, 4]] // %rhs: [[5, 6], [7, 8]] %result = "ttnn.bitwise_or"(%lhs, %rhs) : (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32> // %result: [[5, 6], [7, 12]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.bitwise_xor` (tt::ttnn::BitwiseXorOp)

Eltwise bitwise XOR.

Performs element-wise bitwise XOR of two tensors lhs and rhs and produces a result tensor.

Example: // %lhs: [[1, 2], [3, 4]] // %rhs: [[5, 6], [7, 8]] %result = "ttnn.bitwise_xor"(%lhs, %rhs) : (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32> // %result: [[4, 4], [4, 12]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.capture_or_execute_trace` (tt::ttnn::CaptureOrExecuteTraceOp)

Capture or execute trace.

Captures or executes the trace. Will have read/write memory effects on the cached trace data. If the trace data exists (meaning the trace was captured previously), it will be executed with the execute_callee function. Otherwise, the trace will be captured with the capture_callee function.

Inputs:

device TTNN_Device: The device where the trace was captured.
capture_callee FlatSymbolRefAttr: The symbol of the capture trace function.
execute_callee FlatSymbolRefAttr: The symbol of the execute trace function.
inputs Variadic: The input tensors to the trace function. Outputs:
results Variadic: The output tensors from the trace function.

Interfaces: MemoryEffectOpInterface, TTNN_WorkaroundInterface

Attributes:

Attribute	MLIR Type	Description
`capture_callee`	::mlir::FlatSymbolRefAttr	flat symbol reference attribute
`execute_callee`	::mlir::FlatSymbolRefAttr	flat symbol reference attribute

Operands:

Operand	Description
`device`	TTNN device
`inputs`	variadic of ranked tensor of any type values

Results:

Result	Description
`results`	variadic of ranked tensor of any type values

`ttnn.cbrt` (tt::ttnn::CbrtOp)

Eltwise cubic root.

Eltwise cubic root operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.ceil` (tt::ttnn::CeilOp)

Eltwise ceil.

Eltwise ceil operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.clamp_scalar` (tt::ttnn::ClampScalarOp)

Clamp op.

Clamp tensor values to a specified range.

Example: min: 2.000000+00 input: [[0, 1, 2, 3, 4, 5, 6, 7]] max: 5.000000+00

"ttnn.clamp_scalar"(%arg0) <{max = 2.000000e+00 : f32, min = 5.000000e+00 : f32}> -> %out = [[2, 2, 2, 3, 4, 5, 5, 5]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

min ::mlir::FloatAttr 32-bit float attribute

max ::mlir::FloatAttr 32-bit float attribute

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.clamp_tensor` (tt::ttnn::ClampTensorOp)

Clamp op.

Clamp tensor values to a specified range using min/max as tensor.

Example: min: [[2, 2, 2, 3, 3, 3, 0, 0]] input: [[0, 1, 2, 3, 4, 5, 6, 7]] max: [[5, 5, 5, 9, 9, 9, 6, 6]]

"ttnn.clamp_tensor"(%input, %min, %max) %out: [[2, 2, 2, 3, 4, 5, 6, 6]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values
`min`	ranked tensor of any type values
`max`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.collective_permute` (tt::ttnn::CollectivePermuteOp)

Collective permute op.

Collective permute op. This operation ingests a multi-device tensor spread across multi-devices and will shuffle the data according to source_target_pairs [['src', 'dest']].

In the case of missing 'dest', the device shard living on that device will contain values of 0. For example, device shard living in device 0 will contain 0 values. %source_target_pairs: [[0, 1]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`source_target_pairs`	::mlir::DenseIntElementsAttr	64-bit signless integer elements attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.concat` (tt::ttnn::ConcatOp)

Concat op.

Concat tensors along a given dimension.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

dim ::mlir::IntegerAttr 32-bit signed integer attribute

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`inputs`	variadic of ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.concatenate_heads` (tt::ttnn::ConcatenateHeadsOp)

Concatenate heads op used in attention layer.

Takes in a tensor of shape [batch_size, num_heads, sequence_size, head_size], concatenates heads back along the width dimension and returns the tensor of shape [batch_size, sequence_size, num_heads * head_size].

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.constant` (tt::ttnn::ConstantOp)

Constant op.

Produces tensor filled with given constant value.

Examples: %0 = "ttnn.constant"() {value = dense<[[3, 4, 2], [1, 7, 8]]> : tensor<2x3xui16>} : () -> tensor<2x3xui16> // %0: [[3, 4, 2], [1, 7, 8]] %1 = "ttnn.constant"() {value = dense<[0.2, 1.3]> : tensor<2xf32>} : () -> tensor<2xf32> // %1: [0.2, 1.3]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`value`	::mlir::ElementsAttr	constant vector/tensor attribute

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.conv2d` (tt::ttnn::Conv2dOp)

Conv2d operation.

Applies a 2D convolution over an input image composed of several input planes.

Inputs:

input (AnyRankedTensor): expected in the following flattened format (1, 1, N * H_in * W_in, C) where:
- N is the batch size
- H_in is the height of the input planes
- W_in is the width of the input planes
- C is the number of channels
weight (AnyRankedTensor): expected in the following format (O, C/G, K_H, K_W).
bias (Optional
): expected in the following format (1, 1, 1, O) where:
- C is the number of input channels
- O is the number of output channels
- G is the number of groups
- K_H is the height of the kernel
- K_W is the width of the kernel

Attributes:

in_channels (i32): The number of input channels.
out_channels (i32): The number of output channels.
batch_size (i32): The batch size.
input_height (i32): The input height.
input_width (i32): The input width.
kernel_size (array<2xi32>): [K_H, K_W] where K_H is the kernel height and K_W is the kernel width.
stride (array<2xi32>): [sH, sW] where sH is stride for height and sW is stride for width.
padding (array<2xi32> | array<4xi32>):
- array<2xi32>: [pH, pW] where pH is padding for height (top/bottom) and pW is padding for width (left/right).
- array<4xi32>: [pT, pB, pL, pR] for top, bottom, left, and right padding respectively.
dilation (array<2xi32>): [dH, dW] where dH is dilation for height and dW is dilation for width.
groups (i32): Number of blocked connections from input channels to output channels. Input and output channels must both be divisible by groups.

Outputs:

result (AnyRankedTensor): returned in the following flattened format (1, 1, N * H_out * W_out, O) where:
- H_out = (H_in + pT + pB - dH * (K_H - 1) - 1) / sH + 1
- W_out = (W_in + pL + pR - dW * (K_W - 1) - 1) / sW + 1

Example: %input = ttir.empty() : () -> tensor<1x1x1024x64xbf16> %weight = ttir.empty() : () -> tensor<64x64x3x3xbf16> %bias = ttir.empty() : () -> tensor<1x1x1x64xbf16> %device = "ttnn.get_device"() <{mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !ttnn.device %0 = "ttnn.conv2d"(%input, %weight, %bias, %device) <{ in_channels = 64: i32, out_channels = 64: i32, batch_size = 1: i32, input_height = 32: i32, input_width = 32: i32, kernel_size = array<i32: 3, 3>, stride = array<i32: 1, 1>, padding = array<i32: 0, 0>, dilation = array<i32: 1, 1>, groups = 1: i32 }> : (tensor<1x1x1024x64xbf16>, tensor<64x64x3x3xbf16>, tensor<1x1x1x64xbf16>, !ttnn.device) -> tensor<1x1x900x64xbf16>

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`in_channels`	::mlir::IntegerAttr	32-bit signless integer attribute
`out_channels`	::mlir::IntegerAttr	32-bit signless integer attribute
`batch_size`	::mlir::IntegerAttr	32-bit signless integer attribute
`input_height`	::mlir::IntegerAttr	32-bit signless integer attribute
`input_width`	::mlir::IntegerAttr	32-bit signless integer attribute
`kernel_size`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`stride`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`padding`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`dilation`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`groups`	::mlir::IntegerAttr	32-bit signless integer attribute
`output_dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`conv2d_config`	::mlir::tt::ttnn::Conv2dConfigAttr	TTNN Conv2dConfig attribute {{% markdown %}} Configuration parameters for TTNN conv2d operations that control memory usage, performance optimizations, and execution behavior. Parameters: - `weights_dtype`: Data type for weights and bias tensor after preprocessing (default: bfloat16) - `activation`: Optional activation function to fuse ("relu" or "") - `deallocate_activation`: Whether to deallocate input activation tensor memory (default: false) - `reallocate_halo_output`: Whether to reallocate intermediate halo tensor to reduce memory fragmentation (default: false) - `act_block_h_override`: Override for act_block_h parameter (must be multiple of 32). Smaller values reduce memory usage but decrease performance. Ignored when shard_layout = WIDTH_SHARDED (default: 0) - `act_block_w_div`: Divisor for maximum possible act_block_w parameter. Only useful when in_channels > 2048. Ignored when shard_layout = HEIGHT_SHARDED or BLOCK_SHARDED (default: 1) - `reshard_if_not_optimal`: Whether operation can re-shard input tensor for optimal performance. Mutually exclusive with override_sharding_config (default: false) - `override_sharding_config`: Whether to override input sharding config with provided shard_layout. Mutually exclusive with reshard_if_not_optimal (default: false) - `shard_layout`: Optional tensor memory layout for sharding type specification - `core_grid`: Optional core grid specification. Only applicable when override_sharding_config = true - `transpose_shards`: Whether shards should be distributed in ROW_MAJOR order. Only applicable when not using height sharding (default: true) - `output_layout`: Layout of output tensor - TILE_LAYOUT or ROW_MAJOR_LAYOUT (default: TILE_LAYOUT) - `enable_act_double_buffer`: Enable activation double buffering for increased performance at cost of higher L1 usage (default: false) - `enable_weights_double_buffer`: Enable weights double buffering when using block sharding for increased performance at cost of higher L1 usage (default: false) - `enable_split_reader`: Enable dual concurrent reader kernels instead of one. Only for height sharding, requires act_block_h >= 64 (default: false) - `in_place`: Re-use input tensor storage when creating output tensor (default: false) Example: ```mlir #conv2d_config = #ttnn.conv2d_config< weights_dtype = bfloat16, activation = "relu", deallocate_activation = false, reallocate_halo_output = false, act_block_h_override = 64, act_block_w_div = 1, reshard_if_not_optimal = false, override_sharding_config = false, shard_layout = block_sharded, transpose_shards = true, output_layout = tile, enable_act_double_buffer = false, enable_weights_double_buffer = false, enable_split_reader = false, in_place = false > ``` {{% /markdown %}}
`compute_config`	::mlir::tt::ttnn::DeviceComputeKernelConfigAttr	TTNN DeviceComputeKernelConfig attribute {{% markdown %}} The TTNN_DeviceComputeKernelConfig attribute configures compute kernel execution parameters for tensor operations on Tenstorrent devices. This attribute provides fine-grained control over mathematical precision, memory usage, and synchronization behavior during compute operations. Parameters: - `math_fidelity`: Controls the mathematical precision and accuracy of compute operations. This parameter affects the trade-off between computational speed and numerical precision. Higher fidelity modes provide more accurate results but may require additional computational cycles. - `math_approx_mode`: Configures SFPU operation mode: - Precise mode (false): Higher accuracy with more computational cycles and better PCC - Approximate mode (true): Faster execution with fewer cycles but reduced accuracy - `fp32_dest_acc_en`: Configures destination registers to use 32-bit floating-point precision instead of the default 16-bit mode. It provides higher precision at the cost of reducing available destination register count by half. - `packer_l1_acc`: When packing multiple tiles to the same address, subsequent packs perform accumulation (addition using FP16 or FP32 precision) rather than overwriting. - `dst_full_sync_en`: Configures destination register acquisition mode: - Half mode (false): Acquires 8 tiles in destination registers - Full mode (true): Acquires 16 tiles in destination registers, providing increased parallelism at the cost of higher resource usage Example: ```mlir #device_compute_kernel_config = #ttnn.device_compute_kernel_config< math_fidelity = lofi, math_approx_mode = true, fp32_dest_acc_en = false, packer_l1_acc = false, dst_full_sync_en = false > ``` {{% /markdown %}}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`weight`	ranked tensor of any type values
`bias`	ranked tensor of any type values
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.conv_transpose2d` (tt::ttnn::ConvTranspose2dOp)

ConvTranspose2d operation.

Applies a 2D transposed convolution operator over an input image composed of several input planes.

Inputs:

input AnyRankedTensor: expected in the following format (N, H_in, W_in, C) where:
- N is the batch size
- H_in is the height of the input planes
- W_in is the width of the input planes
- C is the number of channels
weight AnyRankedTensor: expected in the following format (C, O/G, K_H, K_W).
bias Optional: expected in the following format (1, 1, 1, O) where:
- C is the number of input channels
- O is the number of output channels
- G is the number of groups
- K_H is the height of the kernel
- K_W is the width of the kernel
output AnyRankedTensor: expected in the following format (N, H_out, W_out, O) where:
- H_out = (H_in - 1) * stride[0] - 2 * padding[0] + dilation[0] * (K_H - 1) + output_padding[0] + 1
- W_out = (W_in - 1) * stride[1] - 2 * padding[1] + dilation[1] * (K_W - 1) + output_padding[1] + 1

Attributes:

in_channels i32: The number of input channels.
out_channels i32: The number of output channels.
batch_size i32: The batch size.
input_height i32: The input height.
input_width i32: The input width.
kernel_size array<2xi32>: The kernel size.
stride array<2xi32>: Controls the stride for the cross-correlation.
padding array<2xi32>: Controls the amount of implicit zero padding on both sides for dilation * (kernel_size - 1) - padding number of points.
output_padding array<2xi32>: Controls the additional size added to one side of the output shape.
dilation array<2xi32>: Controls the spacing between the kernel points
groups i32: Controls the connections between inputs and outputs. Must be divisible by input and output channels.

Example: // %input: tensor<3x8x8x256xbf16> // %weight: tensor<256x256x3x3xbf16> // %bias: tensor<1x1x1x256xbf16> // %output: tensor<3x10x10x256xbf16> %0 = "ttnn.conv_transpose2d"(%input, %weight, %bias, %output, %device) <{ batch_size = 3: i32, dilation = array<i32: 1, 1>, groups = 1: i32, in_channels = 256: i32, input_height = 8: i32, input_width = 8: i32, kernel_size = array<i32: 3, 3>, out_channels = 256: i32, output_padding = array<i32: 0, 0>, padding = array<i32: 0, 0>, stride = array<i32: 1, 1> }> : (tensor<3x8x8x256xbf16>, tensor<256x256x3x3xbf16>, tensor<1x1x1x256xbf16>, tensor<3x10x10x256xbf16>) -> tensor<3x10x10x256xbf16>

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`in_channels`	::mlir::IntegerAttr	32-bit signless integer attribute
`out_channels`	::mlir::IntegerAttr	32-bit signless integer attribute
`batch_size`	::mlir::IntegerAttr	32-bit signless integer attribute
`input_height`	::mlir::IntegerAttr	32-bit signless integer attribute
`input_width`	::mlir::IntegerAttr	32-bit signless integer attribute
`kernel_size`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`stride`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`padding`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`output_padding`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`dilation`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`groups`	::mlir::IntegerAttr	32-bit signless integer attribute
`output_dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`conv2d_config`	::mlir::tt::ttnn::Conv2dConfigAttr	TTNN Conv2dConfig attribute {{% markdown %}} Configuration parameters for TTNN conv2d operations that control memory usage, performance optimizations, and execution behavior. Parameters: - `weights_dtype`: Data type for weights and bias tensor after preprocessing (default: bfloat16) - `activation`: Optional activation function to fuse ("relu" or "") - `deallocate_activation`: Whether to deallocate input activation tensor memory (default: false) - `reallocate_halo_output`: Whether to reallocate intermediate halo tensor to reduce memory fragmentation (default: false) - `act_block_h_override`: Override for act_block_h parameter (must be multiple of 32). Smaller values reduce memory usage but decrease performance. Ignored when shard_layout = WIDTH_SHARDED (default: 0) - `act_block_w_div`: Divisor for maximum possible act_block_w parameter. Only useful when in_channels > 2048. Ignored when shard_layout = HEIGHT_SHARDED or BLOCK_SHARDED (default: 1) - `reshard_if_not_optimal`: Whether operation can re-shard input tensor for optimal performance. Mutually exclusive with override_sharding_config (default: false) - `override_sharding_config`: Whether to override input sharding config with provided shard_layout. Mutually exclusive with reshard_if_not_optimal (default: false) - `shard_layout`: Optional tensor memory layout for sharding type specification - `core_grid`: Optional core grid specification. Only applicable when override_sharding_config = true - `transpose_shards`: Whether shards should be distributed in ROW_MAJOR order. Only applicable when not using height sharding (default: true) - `output_layout`: Layout of output tensor - TILE_LAYOUT or ROW_MAJOR_LAYOUT (default: TILE_LAYOUT) - `enable_act_double_buffer`: Enable activation double buffering for increased performance at cost of higher L1 usage (default: false) - `enable_weights_double_buffer`: Enable weights double buffering when using block sharding for increased performance at cost of higher L1 usage (default: false) - `enable_split_reader`: Enable dual concurrent reader kernels instead of one. Only for height sharding, requires act_block_h >= 64 (default: false) - `in_place`: Re-use input tensor storage when creating output tensor (default: false) Example: ```mlir #conv2d_config = #ttnn.conv2d_config< weights_dtype = bfloat16, activation = "relu", deallocate_activation = false, reallocate_halo_output = false, act_block_h_override = 64, act_block_w_div = 1, reshard_if_not_optimal = false, override_sharding_config = false, shard_layout = block_sharded, transpose_shards = true, output_layout = tile, enable_act_double_buffer = false, enable_weights_double_buffer = false, enable_split_reader = false, in_place = false > ``` {{% /markdown %}}
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`weight`	ranked tensor of any type values
`bias`	ranked tensor of any type values
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.cos` (tt::ttnn::CosOp)

Eltwise cosine.

Eltwise cosine operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.deallocate` (tt::ttnn::DeallocateOp)

Deallocate op.

Tensor Deallocate operation

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`force`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values

`ttnn.dequantize` (tt::ttnn::DequantizeOp)

Dequantize operation.

Applies dequantization to the input tensor.

Inputs:

input AnyRankedTensor: The input tensor to be dequantized. Must have quantized element type.
scale AnyRankedTensor: The scale factor (or factors for per-axis quantization).
zero_point AnyRankedTensor: The zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.
output_dtype Optional<TTCore_DataTypeAttr>: The data type of the output tensor.
memory_config Optional<TTNN_MemoryConfigAttr>: The memory configuration for the output tensor.

// For per-tensor dequantization:
output[i] = (input[i] - zero_point) * scale
// For per-axis dequantization:
output[i0, i1, ..., ia, ..., in] = (input[i0, i1, ..., ia, ..., in] - zero_point[ia]) * scale[ia]

Example:

%input = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%output = ttir.empty() : () -> tensor<64x128xf32>
%dequantized = "ttnn.dequantize"(%input, %output) : (tensor<64x128x!quant.uniform<i32:f32, 0.1>, tensor<64x128xf32>) -> tensor<64x128xf32>

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

axis ::mlir::IntegerAttr 32-bit signless integer attribute

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values
`scale`	ranked tensor of any type values
`zero_point`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.divide` (tt::ttnn::DivideOp)

Eltwise divide.

Eltwise divide operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.embedding_bw` (tt::ttnn::EmbeddingBackwardOp)

Embedding backward op.

Embedding backward operation. Generates the gradient of the embedding operation with respect to the input.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values
`weight`	ranked tensor of any type values
`in_gradient`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.embedding` (tt::ttnn::EmbeddingOp)

Embedding op.

Embedding operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values
`weight`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.empty` (tt::ttnn::EmptyOp)

Empty op.

Tensor empty operation

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`shape`	::mlir::tt::ttnn::ShapeAttr	TTNN Shape attribute {{% markdown %}} TTNN shape attribute {{% /markdown %}}
`dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`layout`	::mlir::tt::ttnn::LayoutAttr	TTNN Layout
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}

Operands:

Operand	Description
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.end_trace_capture` (tt::ttnn::EndTraceCaptureOp)

End trace capture.

Ends trace capture for the given trace id. Consumes a scalar tensor containing the trace id. Has no output, but will have memory effects on the trace region of the device, modelled by trace resource in the compiler. Inputs:

device TTNN_Device: The device to end the trace capture on.
trace_id AnyRankedTensor: The trace id tensor to end the capture for. Must be a scalar.
cq_id ui32: The command queue to end the capture with. Must be 0 or 1.

Interfaces: MemoryEffectOpInterface, TTNN_WorkaroundInterface

Attributes:

Attribute	MLIR Type	Description
`cq_id`	::mlir::IntegerAttr	32-bit unsigned integer attribute

Operands:

Operand	Description
`device`	TTNN device
`trace_id`	ranked tensor of any type values

`ttnn.eq` (tt::ttnn::EqualOp)

Eltwise equal to.

Eltwise equal to operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.erf` (tt::ttnn::ErfOp)

Eltwise erf op.

Eltwise erf operation. Calculates erf(x) for each element of the input tensor.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.erfc` (tt::ttnn::ErfcOp)

Eltwise erfc op.

Eltwise erfc operation. Calculates erfc(x) for each element of the input tensor.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.execute_trace` (tt::ttnn::ExecuteTraceOp)

Execute trace.

Executes the captured trace. Consumes a scalar tensor containing the trace id. Has no output, but will have read/write memory effects on the cached trace input/output tensors created when capturing the trace. Inputs:

device TTNN_Device: The device where the trace was captured.
trace_id AnyRankedTensor: The trace id tensor to execute. Must be a scalar.
cq_id ui32: The command queue to execute the trace with. Must be 0 or 1.
blocking bool: Whether the trace should be executed synchronously.

Interfaces: MemoryEffectOpInterface, TTNN_WorkaroundInterface

Attributes:

Attribute	MLIR Type	Description
`cq_id`	::mlir::IntegerAttr	32-bit unsigned integer attribute
`blocking`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`device`	TTNN device
`trace_id`	ranked tensor of any type values

`ttnn.exp` (tt::ttnn::ExpOp)

Eltwise exponential.

Eltwise exponential operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.expm1` (tt::ttnn::Expm1Op)

Eltwise unary op.

Performs element-wise exponential minus one operation on operand tensor and stores the result in the output tensor.

Example: %a: [[0, 1], [0, 0]] "ttnn.exmp1"(%a, %out) -> %out: [[0, 1.71828], [0, 0]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.fill_cache` (tt::ttnn::FillCacheOp)

Fill static cache tensor.

Fills the cache tensor in-place with values from input at batch_offset.

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Attributes:

Attribute	MLIR Type	Description
`batch_offset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:

Operand	Description
`cache`	ranked tensor of any type values
`input`	ranked tensor of any type values

`ttnn.floor` (tt::ttnn::FloorOp)

Eltwise floor op.

Eltwise floor operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.from_device` (tt::ttnn::FromDeviceOp)

FromDevice op.

This op retrieves the input tensor from the given device.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.full` (tt::ttnn::FullOp)

Creates a tensor filled with the specified value

Tensor operation to create a tensor filled with a specified value.

Given a shape and a fill_value, produces a tensor with the shape, filled with the specified value.

Example: %0 = "ttnn.full"() <{ dtype = #ttcore.supportedDataTypes, fill_value = 7 : i32, layout = #ttnn.layout, shape = #ttnn.shape<64x128> }> : () -> tensor<64x128xui32> // %0: [[[7, 7, 7, ..., 7], [7, 7, 7, ..., 7], ..., [7, 7, 7, ..., 7]]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`shape`	::mlir::tt::ttnn::ShapeAttr	TTNN Shape attribute {{% markdown %}} TTNN shape attribute {{% /markdown %}}
`fill_value`	::mlir::Attribute	32-bit float attribute or 32-bit signless integer attribute
`dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`layout`	::mlir::tt::ttnn::LayoutAttr	TTNN Layout
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}

Operands:

Operand	Description
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.gelu` (tt::ttnn::GeluOp)

Eltwise GELU.

Eltwise GELU operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.get_device` (tt::ttnn::GetDeviceOp)

Get Device op.

This op returns a submesh carved out from the parent runtime device. Mesh shape and mesh offset define the size and offset of the submesh.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, TTCore_DuplicateConstEvalTrait

Interfaces: ConditionallySpeculatable, InferTypeOpInterface, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`mesh_shape`	::mlir::tt::ttnn::MeshShapeAttr	TTNN Mesh Shape {{% markdown %}} TTNN mesh shape representing the dimensions of a 2D mesh. {{% /markdown %}}
`mesh_offset`	::mlir::tt::ttnn::MeshOffsetAttr	TTNN Mesh Offset {{% markdown %}} TTNN mesh offset representing the starting coordinates in a 2D mesh. {{% /markdown %}}

Results:

Result	Description
`device`	TTNN device

`ttnn.ge` (tt::ttnn::GreaterEqualOp)

Eltwise greater than or equal to.

Eltwise greater than or equal to operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.gt` (tt::ttnn::GreaterThanOp)

Eltwise greater than.

Eltwise greater than operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.isfinite` (tt::ttnn::IsFiniteOp)

Eltwise isfinite op.

Eltwise isfinite operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.leaky_relu` (tt::ttnn::LeakyReluOp)

Eltwise leaky relu operation.

The Leaky ReLU (Rectified Linear Unit) operation computes an element-wise activation function over its input tensor. It is defined as:

y = x if x > 0 y = parameter * x if x <= 0

where parameter is a small, user-defined constant that determines the slope for negative inputs.

Attributes:

parameter (float): The slope for negative values.

Inputs:

input (Tensor): The input tensor to be activated.

Outputs:

output (Tensor): The tensor after applying the Leaky ReLU activation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

parameter ::mlir::FloatAttr 32-bit float attribute

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.le` (tt::ttnn::LessEqualOp)

Eltwise less than or equal to.

Eltwise less than or equal to operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.lt` (tt::ttnn::LessThanOp)

Eltwise less than.

Eltwise less than operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.linear` (tt::ttnn::LinearOp)

Linear transformation of inputs.

Produces the matmul of tensors a and b with optional addition with bias.

Example: // %a = [[1., 2.], [2., 1.]] // %b = [[0., 1.], [1., 0.]] // %bias = [[1.]] "ttnn.linear"(%a, %b, %bias, %result) : (tensor<2x2xf16>, tensor<2x2xf16>, tensor<1xf16>, tensor<2x2xf16>) -> tensor<2x2xf16> // %result = [[3., 2.], [2., 3.]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`transpose_a`	::mlir::BoolAttr	bool attribute
`transpose_b`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`a`	ranked tensor of any type values
`b`	ranked tensor of any type values
`bias`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.log1p` (tt::ttnn::Log1pOp)

Eltwise log1p operation.

Performs element-wise logarithm plus one operation on operand tensor and puts the result in the output tensor.

Example: %a: [0.0, -0.999, 7.0, 6.38905621, 15.0] "ttnn.logp1"(%a, %out) -> %out: [0.0, -6.90776825, 2.07944155, 2.0, 2.77258873]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.log` (tt::ttnn::LogOp)

Eltwise logarithm.

Eltwise logarithm operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.logical_and` (tt::ttnn::LogicalAndOp)

Eltwise logical and.

Eltwise logical and operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.logical_not` (tt::ttnn::LogicalNotOp)

Eltwise logical not op.

Eltwise logical not operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.logical_or` (tt::ttnn::LogicalOrOp)

Eltwise logical or.

Eltwise logical or operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.logical_xor` (tt::ttnn::LogicalXorOp)

Eltwise logical xor.

Eltwise logical xor operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.matmul` (tt::ttnn::MatmulOp)

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`transpose_a`	::mlir::BoolAttr	bool attribute
`transpose_b`	::mlir::BoolAttr	bool attribute
`matmul_program_config`	::mlir::Attribute	TTNN MatmulMultiCoreReuseProgramConfig or TTNN MatmulMultiCoreReuseMultiCastProgramConfig or TTNN MatmulMultiCoreReuseMultiCast1DProgramConfig or TTNN MatmulMultiCoreReuseMultiCastDRAMShardedProgramConfig

Operands:

Operand	Description
`a`	ranked tensor of any type values
`b`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.max` (tt::ttnn::MaxOp)

Max reduction op.

Max reduction op.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.max_pool2d` (tt::ttnn::MaxPool2dOp)

Applies a 2D max pooling over an input signal composed of several input planes.

Applies a 2D max pooling over an input signal composed of several input planes.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`batch_size`	::mlir::IntegerAttr	32-bit signed integer attribute
`input_height`	::mlir::IntegerAttr	32-bit signed integer attribute
`input_width`	::mlir::IntegerAttr	32-bit signed integer attribute
`channels`	::mlir::IntegerAttr	32-bit signed integer attribute
`kernel_size`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`stride`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`padding`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`dilation`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}
`applied_shard_scheme`	::mlir::tt::ttnn::TensorMemoryLayoutAttr	TTNN Tensor Memory Layout
`ceil_mode`	::mlir::BoolAttr	bool attribute
`in_place_halo`	::mlir::BoolAttr	bool attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.maximum` (tt::ttnn::MaximumOp)

Eltwise maximum OP.

Calculates maximum of input tensors' values element-wise and stores result in output tensor.

Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttnn.maximum"(%lhs, %rhs, %out) -> %out: [[3, 4, 7], [1, 4, 4]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.mean` (tt::ttnn::MeanOp)

Mean reduction op.

Mean reduction op.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.mesh_shard` (tt::ttnn::MeshShardOp)

Mesh shard op.

Tensor Mesh Shard operation

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`shard_direction`	::mlir::tt::ttcore::MeshShardDirectionAttr	TT MeshShardDirection
`shard_type`	::mlir::tt::ttcore::MeshShardTypeAttr	MeshShard shard_type attribute in TT dialect {{% markdown %}} Define sharded tensor data of mesh_shard op. - Identity: input and output tensors are pre-sharded (same data) and no sharding is required. - Replicate: all of the devices has full tensor (same data). - Maximal: one or part of the devcices has full tensor (same data). - Devices: all or part of the devices has sharded (partial) tensor (different data). {{% /markdown %}}
`shard_shape`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`shard_dims`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.min` (tt::ttnn::MinOp)

Min reduction op.

This op computes the minimum of all elements of the tensor or along specified dimension.

Example: input: [[1, 5, 3], [4, 2, 6]]

// Computing along dim 0 output: [1, 2, 3]

// Computing along dim 1 output: [1, 2]

// Computing for entire tensor output: 1

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.minimum` (tt::ttnn::MinimumOp)

Eltwise minimum OP.

Calculates minimum of input tensors' values element-wise and stores result in output tensor.

Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttnn.minimum"(%lhs, %rhs, %out) -> %out: [[1, 2, 2], [1, 2, 3]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.moreh_cumsum` (tt::ttnn::MorehCumSumOp)

Moreh cummulative sum op.

Computes the cumulative sum of elements of a tensor along specified dimension.

Example: input: [[1, 2, 3], [4, 5, 6]]

// Cumulative sum along dim=0: output: [[1, 2, 3], [5, 7, 9]]

// Cumulative sum along dim=1: output: [[1, 3, 6], [4, 9, 15]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

dim ::mlir::IntegerAttr 64-bit signless integer attribute

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.multiply` (tt::ttnn::MultiplyOp)

Eltwise multiply.

Eltwise multiply operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.neg` (tt::ttnn::NegOp)

Eltwise negate.

Eltwise negate operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.ne` (tt::ttnn::NotEqualOp)

Eltwise not equal to.

Eltwise not equal to operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.ones` (tt::ttnn::OnesOp)

Creates a tensor filled with ones.

Tensor operation to create a tensor filled with ones.

Given a ShapeAttr shape, produces a tensor with the same shape, filled with ones.

Example: %0 = "ttnn.ones"() <{shape = array<i32:64, 28, 28>}> : () -> tensor<64x28x28xbf16> // %0: [[[1, 1, 1, ..., 1], [1, 1, 1, ..., 1], ..., [1, 1, 1, ..., 1]]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`shape`	::mlir::tt::ttnn::ShapeAttr	TTNN Shape attribute {{% markdown %}} TTNN shape attribute {{% /markdown %}}
`dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`layout`	::mlir::tt::ttnn::LayoutAttr	TTNN Layout
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}

Operands:

Operand	Description
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.pad` (tt::ttnn::PadOp)

Pad op.

Pad input tensor by padding the input_shape to output_shape using the provided value.

The padding attribute must be a sequence of integers that is twice the size as the rank of the input. Each pair of integers in the padding attribute represents the amount of padding to add to the low and high of that dimension. I.e: an input tensor of shape <1x30x30x64xf32> with padding attribute <0, 0, 1, 1, 1, 1, 0, 0> will return a tensor of shape <1x32x32x64xf32>, and so will a padding attribute of <0, 0, 0, 2, 0, 2, 0, 0>.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`padding`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`value`	::mlir::FloatAttr	32-bit float attribute
`use_multicore`	::mlir::BoolAttr	bool attribute
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.permute` (tt::ttnn::PermuteOp)

Permute operation.

Permute input tensor dimensions.

Attributes:

permutation array: The permutation of the input tensor dimensions.

Example: %a = ttir.empty() : () -> tensor<2x3x4xi32> %0 = "ttir.permute"(%a) {permutation = array<i64: 1, 2, 0>} : (tensor<2x3x4xi32>) -> tensor<3x4x2xi32>

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

permutation ::mlir::DenseI64ArrayAttr i64 dense array attribute

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

pad_value ::mlir::FloatAttr 32-bit float attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.point_to_point` (tt::ttnn::PointToPointOp)

Point To Point operation.

Performs point-to-point communication by copying a tensor shard from one device to another within a multi-device mesh. This operation is typically used for explicit data movement in distributed tensor computations, where a specific device (send_coord) sends its local tensor data to a target device (receive_coord).

If accum_tensor is not provided, a new output tensor will be allocated automatically at the receiver. If provided, the data will be written into the specified accum tensor.

The operation returns a multi-device tensor whose buffer layout follows the mesh configuration.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`send_coord`	::mlir::DenseI64ArrayAttr	i64 dense array attribute
`receive_coord`	::mlir::DenseI64ArrayAttr	i64 dense array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`accum_tensor`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.pow` (tt::ttnn::PowOp)

Eltwise power OP.

Performs element-wise exponentiation of lhs tensor by rhs tensor and produces a result tensor. Tensors must be of same shape.

Example:

  %result = "ttnn.pow"(%lhs, %rhs) : (tensor<6xf64>, tensor<6xf64>) -> tensor<6xf64>

  %lhs: [-2.0, -0.0, -36.0, 5.0, 3.0, 10000.0]
  %rhs: [2.0, 2.0, 1.1, 2.0, -1.0, 10.0]
  %result: [4.0, 0.0, -nan, 25.0, 0.333333343, inf]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.prepare_conv2d_bias` (tt::ttnn::PrepareConv2dBiasOp)

Prepares conv2d bias so that it can be consumed by the conv2d op.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`input_memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}
`input_tensor_layout`	::mlir::tt::ttnn::LayoutAttr	TTNN Layout
`in_channels`	::mlir::IntegerAttr	32-bit signless integer attribute
`out_channels`	::mlir::IntegerAttr	32-bit signless integer attribute
`batch_size`	::mlir::IntegerAttr	32-bit signless integer attribute
`input_height`	::mlir::IntegerAttr	32-bit signless integer attribute
`input_width`	::mlir::IntegerAttr	32-bit signless integer attribute
`kernel_size`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`stride`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`padding`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`dilation`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`groups`	::mlir::IntegerAttr	32-bit signless integer attribute
`input_dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`output_dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`conv2d_config`	::mlir::tt::ttnn::Conv2dConfigAttr	TTNN Conv2dConfig attribute {{% markdown %}} Configuration parameters for TTNN conv2d operations that control memory usage, performance optimizations, and execution behavior. Parameters: - `weights_dtype`: Data type for weights and bias tensor after preprocessing (default: bfloat16) - `activation`: Optional activation function to fuse ("relu" or "") - `deallocate_activation`: Whether to deallocate input activation tensor memory (default: false) - `reallocate_halo_output`: Whether to reallocate intermediate halo tensor to reduce memory fragmentation (default: false) - `act_block_h_override`: Override for act_block_h parameter (must be multiple of 32). Smaller values reduce memory usage but decrease performance. Ignored when shard_layout = WIDTH_SHARDED (default: 0) - `act_block_w_div`: Divisor for maximum possible act_block_w parameter. Only useful when in_channels > 2048. Ignored when shard_layout = HEIGHT_SHARDED or BLOCK_SHARDED (default: 1) - `reshard_if_not_optimal`: Whether operation can re-shard input tensor for optimal performance. Mutually exclusive with override_sharding_config (default: false) - `override_sharding_config`: Whether to override input sharding config with provided shard_layout. Mutually exclusive with reshard_if_not_optimal (default: false) - `shard_layout`: Optional tensor memory layout for sharding type specification - `core_grid`: Optional core grid specification. Only applicable when override_sharding_config = true - `transpose_shards`: Whether shards should be distributed in ROW_MAJOR order. Only applicable when not using height sharding (default: true) - `output_layout`: Layout of output tensor - TILE_LAYOUT or ROW_MAJOR_LAYOUT (default: TILE_LAYOUT) - `enable_act_double_buffer`: Enable activation double buffering for increased performance at cost of higher L1 usage (default: false) - `enable_weights_double_buffer`: Enable weights double buffering when using block sharding for increased performance at cost of higher L1 usage (default: false) - `enable_split_reader`: Enable dual concurrent reader kernels instead of one. Only for height sharding, requires act_block_h >= 64 (default: false) - `in_place`: Re-use input tensor storage when creating output tensor (default: false) Example: ```mlir #conv2d_config = #ttnn.conv2d_config< weights_dtype = bfloat16, activation = "relu", deallocate_activation = false, reallocate_halo_output = false, act_block_h_override = 64, act_block_w_div = 1, reshard_if_not_optimal = false, override_sharding_config = false, shard_layout = block_sharded, transpose_shards = true, output_layout = tile, enable_act_double_buffer = false, enable_weights_double_buffer = false, enable_split_reader = false, in_place = false > ``` {{% /markdown %}}

Operands:

Operand	Description
`bias_tensor`	ranked tensor of any type values
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.prepare_conv2d_weights` (tt::ttnn::PrepareConv2dWeightsOp)

Prepares conv2d weights so that they can be consumed by the conv2d op.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`input_memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}
`input_tensor_layout`	::mlir::tt::ttnn::LayoutAttr	TTNN Layout
`weights_format`	::mlir::StringAttr	string attribute
`in_channels`	::mlir::IntegerAttr	32-bit signless integer attribute
`out_channels`	::mlir::IntegerAttr	32-bit signless integer attribute
`batch_size`	::mlir::IntegerAttr	32-bit signless integer attribute
`input_height`	::mlir::IntegerAttr	32-bit signless integer attribute
`input_width`	::mlir::IntegerAttr	32-bit signless integer attribute
`kernel_size`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`stride`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`padding`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`dilation`	::mlir::DenseI32ArrayAttr	i32 dense array attribute
`has_bias`	::mlir::BoolAttr	bool attribute
`groups`	::mlir::IntegerAttr	32-bit signless integer attribute
`input_dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`output_dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`conv2d_config`	::mlir::tt::ttnn::Conv2dConfigAttr	TTNN Conv2dConfig attribute {{% markdown %}} Configuration parameters for TTNN conv2d operations that control memory usage, performance optimizations, and execution behavior. Parameters: - `weights_dtype`: Data type for weights and bias tensor after preprocessing (default: bfloat16) - `activation`: Optional activation function to fuse ("relu" or "") - `deallocate_activation`: Whether to deallocate input activation tensor memory (default: false) - `reallocate_halo_output`: Whether to reallocate intermediate halo tensor to reduce memory fragmentation (default: false) - `act_block_h_override`: Override for act_block_h parameter (must be multiple of 32). Smaller values reduce memory usage but decrease performance. Ignored when shard_layout = WIDTH_SHARDED (default: 0) - `act_block_w_div`: Divisor for maximum possible act_block_w parameter. Only useful when in_channels > 2048. Ignored when shard_layout = HEIGHT_SHARDED or BLOCK_SHARDED (default: 1) - `reshard_if_not_optimal`: Whether operation can re-shard input tensor for optimal performance. Mutually exclusive with override_sharding_config (default: false) - `override_sharding_config`: Whether to override input sharding config with provided shard_layout. Mutually exclusive with reshard_if_not_optimal (default: false) - `shard_layout`: Optional tensor memory layout for sharding type specification - `core_grid`: Optional core grid specification. Only applicable when override_sharding_config = true - `transpose_shards`: Whether shards should be distributed in ROW_MAJOR order. Only applicable when not using height sharding (default: true) - `output_layout`: Layout of output tensor - TILE_LAYOUT or ROW_MAJOR_LAYOUT (default: TILE_LAYOUT) - `enable_act_double_buffer`: Enable activation double buffering for increased performance at cost of higher L1 usage (default: false) - `enable_weights_double_buffer`: Enable weights double buffering when using block sharding for increased performance at cost of higher L1 usage (default: false) - `enable_split_reader`: Enable dual concurrent reader kernels instead of one. Only for height sharding, requires act_block_h >= 64 (default: false) - `in_place`: Re-use input tensor storage when creating output tensor (default: false) Example: ```mlir #conv2d_config = #ttnn.conv2d_config< weights_dtype = bfloat16, activation = "relu", deallocate_activation = false, reallocate_halo_output = false, act_block_h_override = 64, act_block_w_div = 1, reshard_if_not_optimal = false, override_sharding_config = false, shard_layout = block_sharded, transpose_shards = true, output_layout = tile, enable_act_double_buffer = false, enable_weights_double_buffer = false, enable_split_reader = false, in_place = false > ``` {{% /markdown %}}

Operands:

Operand	Description
`weight_tensor`	ranked tensor of any type values
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.prod` (tt::ttnn::ProdOp)

Product reduction op.

This op computes the product of all elements of the tensor (full product) or along a specific dimension.

Example: input: [[1, 2, 3], [4, 5, 6]]

// Computing along dim 0 output: [4, 10, 18]

// Computing along dim 1 output: [6, 120]

// Computing full product output: 720

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

dim_arg ::mlir::IntegerAttr 64-bit signless integer attribute

keep_dim ::mlir::BoolAttr bool attribute

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.quantize` (tt::ttnn::QuantizeOp)

Quantize operation.

Applies quantization to the input tensor.

Inputs:

input AnyRankedTensor: The input tensor to be quantized. Must have floating-point element type.
scale AnyRankedTensor: The scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
zero_point AnyRankedTensor: The zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.
output_dtype Optional<TTCore_DataTypeAttr>: The data type of the output tensor.
memory_config Optional<TTNN_MemoryConfigAttr>: The memory configuration for the output tensor.

// For per-tensor quantization:
output[i] = round(input[i] / scale) + zero_point
// For per-axis quantization:
output[i0, i1, ..., ia, ..., in] = round(input[i0, i1, ..., ia, ..., in] / scale[ia]) + zero_point[ia]

Example:

%input = ttir.empty() : () -> tensor<64x128xf32>
%output = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%quantized = "ttir.quantize"(%input, %output) : (tensor<64x128xf32>, tensor<64x128x!quant.uniform<i32:f32, 0.1>>) -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

axis ::mlir::IntegerAttr 32-bit signless integer attribute

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values
`scale`	ranked tensor of any type values
`zero_point`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.rand` (tt::ttnn::RandOp)

Random number generation operation.

Returns a tensor filled with random numbers drawn from a uniform distribution over given interval [low, high) [Default: [0, 1)].

Example: %0 = "ttnn.get_device"() <{mesh_offset = #ttnn<mesh_offset 0x0>, mesh_shape = #ttnn<mesh_shape 1x1>}> : () -> !ttnn.device %1 = "ttnn.rand"(%0) <{dtype = #ttcore.supportedDataTypes, high = 1.000000e+00 : f32, layout = #ttnn.layout, low = 0.000000e+00 : f32, memory_config = #ttnn.memory_config<#dram, >, seed = 0 : ui32, size = [32 : i32, 32 : i32]}> : (!ttnn.device) -> tensor<32x32xbf16, #ttnn_layout1>

Attributes:

size (TTNN_ShapeAttr): The shape of the tensor to create.
device (TTNN_Device): The device where the trace was captured.
dtype (mlir:Type): Data type of the returned tensor.
layout (TTNN_LayoutAttr): The layout for the output tensor.
memory_config (TTNN_MemoryConfigAttr): The memory configuration for the output tensor.
low (Float): The lower bound of the range (inclusive) [Default: 0.0].
high (Float): The upper bound of the range (exclusive) [Default: 1.0].
seed (Integer): Value to initialize the random number generator for reproducible results [Default: 0].

Outputs:

result (Tensor): The generated tensor containing the random values.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`size`	::mlir::tt::ttnn::ShapeAttr	TTNN Shape attribute {{% markdown %}} TTNN shape attribute {{% /markdown %}}
`dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`layout`	::mlir::tt::ttnn::LayoutAttr	TTNN Layout
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}
`low`	::mlir::FloatAttr	32-bit float attribute
`high`	::mlir::FloatAttr	32-bit float attribute
`seed`	::mlir::IntegerAttr	32-bit unsigned integer attribute

Operands:

Operand	Description
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.reciprocal` (tt::ttnn::ReciprocalOp)

Eltwise reciprocal.

Eltwise reciprocal operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.reduce_scatter` (tt::ttnn::ReduceScatterOp)

Reduce scatter op.

Tensor Reduce Scatter operation

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`reduce_type`	::mlir::tt::ttcore::ReduceTypeAttr	TT Reduce Type
`scatter_dim`	::mlir::IntegerAttr	32-bit signed integer attribute
`cluster_axis`	::mlir::IntegerAttr	32-bit unsigned integer attribute
`num_links`	::mlir::IntegerAttr	32-bit unsigned integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.relu` (tt::ttnn::ReluOp)

Eltwise ReLU.

Eltwise ReLU operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.remainder` (tt::ttnn::RemainderOp)

Eltwise remainder.

Performs element-wise remainder of dividend lhs and divisor rhs tensors and produces a result tensor.

Example:

// %lhs: [17, -17, 17, -17] // %rhs: [3, 3, -3, -3] %result = "ttnn.remainder"(%lhs, %rhs) : (tensor<4xi64>, tensor<4xi64>) -> tensor<4xi64> // %result: [2, -2, 2, -2]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.repeat_interleave` (tt::ttnn::RepeatInterleaveOp)

Repeat interleave op.

Repeats elements of a tensor along a specified dimension. It allows for flexible repetition patterns, where each element can be repeated a different number of times. This is particularly useful for tasks that require duplicating elements in a non-uniform manner.

Parameters:

input: The input tensor.
repeats: Specifies the number of repetitions for each element, each element is repeated that number of times.
dim: The dimension along which to repeat values.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

repeats ::mlir::IntegerAttr 32-bit unsigned integer attribute

dim ::mlir::IntegerAttr 32-bit signed integer attribute

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.repeat` (tt::ttnn::RepeatOp)

Repeat op.

Returns a new tensor filled with repetition of input tensor according to number of times specified in repeat_dims.

Parameters:

input_tensor (ttnn.Tensor): the input tensor.
repeat_dims (number): The number of repetitions for each element.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`repeat_dims`	::mlir::tt::ttnn::ShapeAttr	TTNN Shape attribute {{% markdown %}} TTNN shape attribute {{% /markdown %}}

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.requantize` (tt::ttnn::RequantizeOp)

Requantize operation.

Applies requantization to the input tensor.

Inputs:

input AnyRankedTensor: The input tensor to be requantized. Must have quantized element type.
in_scale AnyRankedTensor: The input scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
in_zero_point AnyRankedTensor: The input zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
out_scale AnyRankedTensor: The output scale factor (or factors for per-axis quantization). Must be either a scalar (for per-tensor quantization) or a 1D tensor with size matching the dimension of the specified axis (for per-axis quantization).
out_zero_point AnyRankedTensor: The output zero point value (or values for per-axis quantization). Must be in range of the quantized storage type.
axis Optional: The axis along which quantization is applied. Must be in range [0, rank) where rank is the rank of the input tensor.
output_dtype Optional<TTCore_DataTypeAttr>: The data type of the output tensor.
memory_config Optional<TTNN_MemoryConfigAttr>: The memory configuration for the output tensor.

// For per-tensor requantization:
output[i] = round((input[i] - input_zero_point) * (input_scale / output_scale)) + output_zero_point
// For per-axis requantization:
output[i0, i1, ..., ia, ..., in] = round((input[i0, i1, ..., ia, ..., in] - in_zero_point[ia]) * (in_scale[ia] / out_scale[ia])) + out_zero_point[ia]

Example:

%input = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.1>>
%output = ttir.empty() : () -> tensor<64x128x!quant.uniform<i32:f32, 0.2>>
%requantized = "ttnn.requantize"(%input, %output) : (tensor<64x128x!quant.uniform<i32:f32, 0.1>, tensor<64x128x!quant.uniform<i32:f32, 0.2>>) -> tensor<64x128x!quant.uniform<i32:f32, 0.2>>

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

axis ::mlir::IntegerAttr 32-bit signless integer attribute

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values
`in_scale`	ranked tensor of any type values
`in_zero_point`	ranked tensor of any type values
`out_scale`	ranked tensor of any type values
`out_zero_point`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.reshape` (tt::ttnn::ReshapeOp)

Reshape op.

Reshape tensor.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

shape ::mlir::ArrayAttr 32-bit integer array attribute

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.rsqrt` (tt::ttnn::RsqrtOp)

Eltwise rsqrt.

Eltwise rsqrt operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.scatter` (tt::ttnn::ScatterOp)

Scatter op.

Embeds the values of the 'update' tensor into 'input' at the given index and puts the value in the 'output' tensor.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.sigmoid` (tt::ttnn::SigmoidOp)

Eltwise sigmoid.

Eltwise sigmoid operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.sign` (tt::ttnn::SignOp)

Eltwise sign operation.

Returns the sign of the operand element-wise and produces a result tensor.

Example: %a: [[3, -2, 0], [1, -4, 4]] "ttnn.sign"(%a, %out) -> %out: [[1, -1, 0], [1, -1, 1]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.sin` (tt::ttnn::SinOp)

Eltwise sine.

Eltwise sine operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.slice` (tt::ttnn::SliceOp)

Slice op.

Extract a portion of a tensor based on the specified start (begins), stop (ends), and step indices for each dimension.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`begins`	::mlir::ArrayAttr	32-bit integer array attribute
`ends`	::mlir::ArrayAttr	32-bit integer array attribute
`step`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.softmax` (tt::ttnn::SoftmaxOp)

Softmax op.

Softmax operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dimension`	::mlir::IntegerAttr	32-bit signed integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.sort` (tt::ttnn::SortOp)

Sort op.

Sorts elements of a tensor along a given dimension.

Input:

input: AnyRankedTensor

Attributes:

dim (int8): The dimension to sort along (default: -1, the last dim).
descending (bool): If True, sort in descending order (default: False).
stable (bool): If True, ensures stable sort (equal elements keep order).

Returns a tuple:

values: the sorted tensor.
indices: the original indices of the sorted values.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dim`	::mlir::IntegerAttr	8-bit signed integer attribute
`descending`	::mlir::BoolAttr	bool attribute
`stable`	::mlir::BoolAttr	bool attribute
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`values`	ranked tensor of any type values
`indices`	ranked tensor of any type values

`ttnn.sqrt` (tt::ttnn::SqrtOp)

Eltwise sqrt.

Eltwise sqrt operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.subtract` (tt::ttnn::SubtractOp)

Eltwise subtract.

Eltwise subtract operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, HasOutputDTypeTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

output_dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`lhs`	ranked tensor of any type values
`rhs`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.sum` (tt::ttnn::SumOp)

Sum reduction op.

Sum reduction op.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`keep_dim`	::mlir::BoolAttr	bool attribute
`dim_arg`	::mlir::ArrayAttr	32-bit integer array attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.tan` (tt::ttnn::TanOp)

Eltwise tan op.

Eltwise tan operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.tanh` (tt::ttnn::TanhOp)

Eltwise tanh op.

Eltwise tanh operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.to_dtype` (tt::ttnn::ToDTypeOp)

ToDType op.

This op converts the data type of the input tensor based on the given data type on the host.

Args:

:attr:input: the ttnn.Tensor
:attr:dtype: ttnn data type.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.to_device` (tt::ttnn::ToDeviceOp)

ToDevice op.

This op sends the input tensor to the given device with the given memory config.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.to_layout` (tt::ttnn::ToLayoutOp)

ToLayout op.

This op wraps all layout information gathered from ttir.toLayout. It is used/updated by the optimizer to perform optimizations, and later broken down into specific memory/layout operations (toDevice, toMemoryConfig etc.). Currently in the TTNN backend, we use this op solely for tilize/untilize, therefore marking all other attrs as optional. Once ttnn::to_layout supports other attrs, we can remove the optional tag.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

layout ::mlir::tt::ttnn::LayoutAttr TTNN Layout

dtype ::mlir::tt::ttcore::DataTypeAttr TT DataTypes

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.to_memory_config` (tt::ttnn::ToMemoryConfigOp)

ToMemoryConfig op.

This op converts the memory config of the input tensor based on the given memory config. It handles:

Dram to L1
L1 to Dram
Interleaved to sharded
Sharded to interleaved
Sharded to sharded (reshard)

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.transpose` (tt::ttnn::TransposeOp)

Transpose op.

Transpose tensor along two given dimensions.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dim0`	::mlir::IntegerAttr	32-bit signed integer attribute
`dim1`	::mlir::IntegerAttr	32-bit signed integer attribute

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.typecast` (tt::ttnn::TypecastOp)

Typecast op.

This op converts the data type of the input tensor based on the given data type. It handles:

conversions of data types.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.update_cache` (tt::ttnn::UpdateCacheOp)

Update static cache tensor.

Updates the cache tensor in-place with values from input at update_index and batch_offset.

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Attributes:

Attribute	MLIR Type	Description
`batch_offset`	::mlir::IntegerAttr	32-bit signless integer attribute

Operands:

Operand	Description
`cache`	ranked tensor of any type values
`input`	ranked tensor of any type values
`update_index`	ranked tensor of any type values

`ttnn.upsample` (tt::ttnn::UpsampleOp)

Upsample 2D op.

Upsample 2D operation. Input tensor is assumed to be in NHWC format.

Attributes:

scale_factor (si32 | array): The scale factor for upsampling in H and W dimensions respectively.
mode (str): The upsampling algorithm. Currently only "nearest" and "bilinear" are supported. Default is "nearest".

Example: // %a: tensor<10x64x32xbf16> %0 = "ttnn.upsample"(%a) <{scale_factor = array<i32: 2, 4>}> : (tensor<10x64x32x3xbf16>) -> tensor<10x128x128x3xbf16>

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

scale_factor ::mlir::Attribute 32-bit signed integer attribute or i32 dense array attribute

mode ::mlir::StringAttr string attribute

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`input`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.where` (tt::ttnn::WhereOp)

Eltwise where.

Eltwise where operation.

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, ExplicateOperandBroadcastsTrait, HasMemoryConfigTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute MLIR Type Description

memory_config

::mlir::tt::ttnn::MemoryConfigAttr

TTNN MemoryConfig attribute

{{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware.

This attribute specifies:
- `bufferType` - specifies which memory type to use (L1, DRAM, System Memory).
- `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded)
- `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores.

Examples:
```mlir
// Simple interleaved memory in DRAM
#ttnn.memory_config<#dram, <interleaved>>

// L1 memory with block sharding across cores
#ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>>
```

Operands:

Operand	Description
`first`	ranked tensor of any type values
`second`	ranked tensor of any type values
`third`	ranked tensor of any type values

Results:

Result	Description
`result`	ranked tensor of any type values

`ttnn.write_tensor` (tt::ttnn::WriteTensorOp)

Write tensor op.

Copies host_tensor data into device_tensor through cq_id. Memory copy is done in place, thus no output is returned. Inputs:

host_tensor AnyRankedTensor: The host tensor to copy.
device_tensor AnyRankedTensor: The device tensor to copy into.
blocking bool: Whether the copy should be executed synchronously.
cq_id i32: The command queue to copy the tensor with. Must be 0 or 1.

Interfaces: MemoryEffectOpInterface (MemoryEffectOpInterface), TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{MemoryEffects::Write on ::mlir::SideEffects::DefaultResource}

Attributes:

Attribute	MLIR Type	Description
`blocking`	::mlir::BoolAttr	bool attribute
`cq_id`	::mlir::IntegerAttr	32-bit unsigned integer attribute

Operands:

Operand	Description
`host_tensor`	ranked tensor of any type values
`device_tensor`	ranked tensor of any type values

`ttnn.zeros` (tt::ttnn::ZerosOp)

Creates a tensor filled with zeros.

Tensor operation to create a tensor filled with zeros.

Given a ShapeAttr shape, produces a tensor with the same shape, filled with zeros.

Example: %0 = "ttnn.zeros"() <{shape = array<i32:64, 28, 28>}> : () -> tensor<64x28x28xbf16> // %0: [[[0, 0, 0, ..., 0], [0, 0, 0, ..., 0], ..., [0, 0, 0, ..., 0]]]

Traits: AlwaysSpeculatableImplTrait, CheckBFloat8BTrait, HasMemoryConfigTrait, TTCore_CreationOpTrait

Interfaces: ConditionallySpeculatable, NoMemoryEffect (MemoryEffectOpInterface), OpModel, TTNN_WorkaroundInterface

Effects: MemoryEffects::Effect{}

Attributes:

Attribute	MLIR Type	Description
`shape`	::mlir::tt::ttnn::ShapeAttr	TTNN Shape attribute {{% markdown %}} TTNN shape attribute {{% /markdown %}}
`dtype`	::mlir::tt::ttcore::DataTypeAttr	TT DataTypes
`layout`	::mlir::tt::ttnn::LayoutAttr	TTNN Layout
`memory_config`	::mlir::tt::ttnn::MemoryConfigAttr	TTNN MemoryConfig attribute {{% markdown %}} The `MemoryConfigAttr` defines how a tensor is stored in memory on Tenstorrent hardware. This attribute specifies: - `bufferType` - specifies which memory type to use (L1, DRAM, System Memory). - `tensorMemoryLayout` - defines how the tensor is laid out in memory (interleaved, block_sharded, width_sharded, height_sharded) - `shardSpec` - optional parameter is only used with sharded memory layouts and defines how the tensor is distributed across multiple cores. Examples: ```mlir // Simple interleaved memory in DRAM #ttnn.memory_config<#dram, <interleaved>> // L1 memory with block sharding across cores #ttnn.memory_config<#l1, <block_sharded>, #ttnn.shard_spec<#ttnn.core_range_set<[#ttnn.core_range<(0,0), (7, 0)>]>, <32x128>, <row_major>, <physical>>> ``` {{% /markdown %}}

Operands:

Operand	Description
`device`	TTNN device

Results:

Result	Description
`result`	ranked tensor of any type values

tt-mlir documentation

NOTE: This is a developer's guide on how to test EmitC as a feature. For usage of EmitC, please refer to ttnn-standalone docs.

NOTE: This is a developer's guide on how to test EmitC as a feature. For usage of EmitC, please refer to `ttnn-standalone` docs.