Introduction
The following document provides an overview of the TT-MLIR project, with a focus on the technical specifications of an MLIR-based compiler stack. So what exactly is an MLIR-based compiler stack? MLIR (Multi Level Intermediate Representation) is a subproject coming out of the LLVM Project. It seeks to introduce extensibility and sustainable code design to a very modular compiler framework. This essentially means to take a much larger more involved compiler (like LLVM) and split it into sub-compilers that each produce their own Intermediate Representation (IR) of what you've fed the compiler.
Disclaimer: This is intended to be a working document, if you find something incorrect or incomplete please feel free to create a PR.
Motivations
The idea of having a multi-level IR might not seem so far fetched, in fact it resembles some of our current software stacks. The idea of going from a High Level TVM Graph → Lowered PyBUDA Graph → Netlist, with each layer having their own level of optimizations is quite a familiar concept. However, there are problems with the reusability and integration of optimizations for the current software compiler stack. Currently, users are almost forced to choose between a top-down optimization or bottom-up optimization, with both requiring "expert-level" expertise to optimize for desired performance. Developing 2 entirely different projects is taxing, and it's hard to translate the benefits of BUDA over to metal (or the other way around). One of the primary goals of tt-mlir is to enable a consistent programming model between software stacks, concepts for improving optimizations in the compiler stack should 1-1 carry over to hand-written TTNN.
The benefits grow even further when one can understand all the possible entry points that multiple IRs present. Existing MLIR based projects like OpenXLA and torch-mlir can natively output MLIR in a dialect that can be transcribed to the TTIR dialect as well!
What is MLIR and why use it?
MLIR is a compiler infrastructure that is designed to be modular and extensible. The main benefits the tt-mlir project hopes to gain by using MLIR include:
- Industry Standard Compiler Framework
- Lots of boilerplate algorithms, data structures, and useful software that is common to compiler development
- Ecosystem
- Hook into existing front-end MLIR projects
- Testing framework
- A battle-tested test infrastructure that will enable us to write fine grained tests and rely less on end-to-end testing
- Common IR Serialization Format that's easy to test, debug, and edit
Additional documentation to highlight the benefits of MLIR can be found here:
MLIR: Overview
MLIR is at it's root an interpreter that can parse "readable" text in some .mlir format. The unique properties lie in the modularity of the parsing itself. MLIR is built upon a collection of Dialects, each of these Dialects define a collection of Operations, Types, and Attributes. These dialects follow their own syntax, and they can encode any amount of information. The benefit is that MLIR provides bindings and hooks such that a user can directly translate these IRs into usable artifacts for that layer of complexity. An example of this would be the relatively high level TOSA Dialect, which is used to represent computation over tensors, and then lowering that to a more hardware specific dialect that closely models the programming model of the hardware or underlying backend. It is the dialect system itself which powers the multi-level functionality of MLIR, with different dialects a user can essentially "lower" through their software stack by just transforming between the different dialects for their layers. Dialects can exist in a broad range from purely mathematical dialects, to a LinAlg Dialect, or a Tensorflow Dialect defined for ML Graphs. Each dialect encodes it's own information and their operations can use the Types/Attributes of other dialects as parameters. Multiple dialects are possible in one module, and encouraged to highlight optimizations of different dialects. In our usecase for the TT Stack, MLIR acts a "mid-level" compiler which makes the task of joining together various entry points and backends much simpler.
MLIR Primitives
So what does MLIR look like, how does it work and get parsed? The hierarchy of an MLIR Module is as shown:
#any_device = #tt.operand_constraint<dram|l1|scalar|tile|any_device|any_device_tile>
module attributes {tt.system_desc = #tt.system_desc<[<#tt.arch<wormhole_b0>, #tt.grid<8x8>>], [0], [<pcie|host_mmio>], [<0, 0, 0, 0>]>} {
func.func @forward(%arg0: tensor<64x128xf32>, %arg1: tensor<64x128xf32>) -> tensor<64x128xf32> {
%0 = tensor.empty() : tensor<64x128xf32>
%1 = "ttir.multiply"(%arg0, %arg1, %0) <{operandSegmentSizes = array<i32: 2, 1>, operand_constraints = [#any_device, #any_device, #any_device]}> : (tensor<64x128xf32>, tensor<64x128xf32>, tensor<64x128xf32>) -> tensor<64x128xf32>
return %1 : tensor<64x128xf32>
}
}
-
Attributes (defined using #)
- The syntax of actually creating an attribute is modular, and custom assembly instructions for different attributes can be applied.
-
Operations
-
These operations are accessed with the . method, so you'll see some examples like func.func or tensor.empty. Each operation also provides it's own assembly instructions but often strictly defines the type of result
-
Quotes are added around ttir.multiply since it's part of a custom dialect, and more custom assembly instructions are applied to specify operand_constraints.
-
Operations typically have operands (arguments) and results which are highlighted with %, these results and operands help to show the relationship between operations
-
-
Types
-
Types are shown as dataformats throughout this compiled mlir module, where tensor and array are some examples.
-
They help to demonstrate the transformation of information and it's representation as it's processed across this module.
-
MLIR Workflow
The overall MLIR workflow doesn't involve writing .mlir files, not necessarily even modifying them. The Intermediate Representations are truly just representations, we can parse them to demonstrate what the graph looks like at that current stage of optimization, or run a pass through them to optimize certain functions. The overall framework is designed with the following architecture in mind:
-
Graph Information exists
-
Graph Information is transformed (through any which method) into a high-level MLIR representation
-
Passes are run on the high-level implementation to lower into TTIR, a common IR that can be lowered into multiple backends
-
Depending on the usecase more passes are run to lower to whatever backend the user would like (ex: TTNN Backend)
What are Passes?
Transformations in MLIR are represented as passes that occur during the parsing of some information. These passes can be executed when parsing or generating MLIR modules. These transformations can have a myriad of purposes, and are completely user defined as to how they modify the module. Some examples of passes can be for lowering purposes as mentioned before, where a dialect is parsed and then each operation is transformed to a lowered dialect following some set of user defined rules. Passes are also used for optimizations and backend code transformation in the context of this project. They're a powerful tool and provide most of the functionality to transform between layers of dialects, and they provide a simple platform for modifications of an MLIR module.
Why not make our own?
Now that I've described the functionality of the MLIR framework, it seems like making an in house multi level Intermediate Representation system would be pretty similar, so why are we going through the effort of implementing this framework?
One of the biggest reason can be attributed to the active developer community surrounding the project, being a part of the LLVM Project means that there is solid developer support, and the framework is designed to be a tool for many different paradigms of compute. This scalability and strong mission statement lend to the strengths of MLIR being a solid platform to use as a middle layer in our compiler stack. Furthermore, as a functional benefit of being part of a larger open source project, MLIR has a whole library of tests and infrastructure that we can leverage for solid code health while starting a new project.
Automation
It's not only about developer support, another key benefit of MLIR is that it's built with autogeneration in mind. Through TableGen a lot of the boilerplate of creating this multi-level IR become abstracted away to truly focus on implementation and execution. This automation is built on top of a pre-existing robust framework with a lot of implementations and support from other large players in the ML scene. By integrating with these automation pipelines, we allow for external developers to have a much simpler entry-point into our software stack!
TT-MLIR: Bringing MLIR to the TT Stack
Now that we have defined this pretty cool project, let's look at the implementation details of bringing MLIR (and related optimizations) into the TT Stack. Since it acts as a mid-level compiler we can start by defining the "bottom" and "top" layers of the compiler. BUDA already has a well defined set of frontend optimizations to some TVM defined graph and is knowledgeable of the hardware that these models want to run on. We want to interrupt the BUDA stack to only give us the frontend compiled graph before any hardware specific lowering is to occur. What this will produce is information that is agnostic to different backends and their execution on TT hardware, but this is still valid information to optimize at different levels for later compilation. The "bottom" of our graph is now defined as the backend that will produce the machine-specific code to be executed. While MLIR could allow for any level of complexity downwards for the bottom, we will define a very aggressive TTNN backend for the MVP. Desired Optimization List:
-
Forge-FE (frontend)
- Graph Optimizations, Constant Folding, Operation Fusion
-
TT-MLIR (mid-level)
- Data Storage, Memory Configuration, Grid Configuration
-
TT-NN (backend)
- Kernel Configuration*, Network Optimization
*Subject to Change / Be Moved to TT-MLIR
TT-MLIR Dialects
Now that we have defined the series of optimizations that we would like to see implemented in TT-MLIR, we can begin to help define the dialects that would help to support these different levels of optimizations. For more detail on each of these dialects, please refer to the GitHub Wiki and TableGen descriptors. I think that Nick does a great job of documenting the key functionality.
TT Dialect
The TT Dialect is only for common Types and Attributes used throughout the many levels of the mid level compiler.
TTIR Dialect
The TTIR Dialect is defined as the common dialect for TT-MLIR, as such it doesn't define anything hardware/backend specific. It lists out general actions that would take place on TT hardware such as dispatch, layout, and kernel operations.
Generic Operation
This is one of two operations that's crucial to understand the intended optimization characteristics of the TTIR Dialect. The generic operation dictates the actions that would be taken to dispatch some instruction to TT hardware such that it executes some instruction. Parametrically, the operation consumes inputs, outputs, maps to read the tensors, and access-types to the memory. These parameters highlight the optimizations that can be performed at this level to change the location of the memory, transpose using variant access maps, or even the grid upon which the computation takes place. The operation also contains a block in which the exact behaviour for that operation to occur is stored.
Layout Operation
The layout operation is key in describing the storage of memory throughout the execution graph. Layout determines the sharding spec, location of the memory, data types, and tile sizes of some tensor. While generic describes the dispatch for some data-wise transformation to take place, the data itself is laid out across the chip through the layout operation.
Both of these operations describe the key functionality of the TTIR dialect and the optimization space that it provides.
Built-in MLIR Dialects
The functionality of TT-MLIR Dialects also depends / is inspired by the functionality of Built-in MLIR Dialects like Affine and LinAlg. Below are summaries of some of the key members of these Dialects
Affine Dialect
[Reference] Affine maps help to describe transformations on coordinate systems, while this may not really make sense, imagine trying to index a rank 2 tensor. By getting t[x, y] I can access the element in the Xth row and Yth column, but if I wanted to transpose the tensor I might have to re-layout the entire tensor such that the data would be accessible using t[x, y] to get the element in the Yth row and Xth column. This transpose can also be represented using an Affine Map to transform (x, y) -> (y, x) and this would let the tensor data remain in place while the access method is modified. This extends even further to more complex transformations such that stride lengths or unique indexing methods can be implemented without complicated manipulation.
Tensor Dialect
[Reference] The tensor dialect defines the functionality and Type of the fundamental Tensor. This dialect contains members that would represent manipulation and representation of tensors as multi-dimensional data with shapes and datatypes. Not much else is different about this dialect, the reference covers key topics if implementation details are needed.
Func Dialect
TOSA Dialect
SCF Dialect
EmitC Dialect
TT-Explorer - Performance Optimization Tool
A unique project related to TT-MLIR is the integration of Performance Optimization Tools such that users are easily able to visualize and readily tune their models without needing an expert level understanding of the tech stack. TT-Explorer is built with Google AI's Model Explorer as a base for the visualization tool, and a custom adapter to parse TT-MLIR projects. This would allow users to readily tune their models, and optimize for the TTIR layer (ex: they can change certain memory to be laid out in L1 instead of DRAM, or change the grid layout of an operation to be larger than what was previously assigned). After compilation with these overrides, the runtime information can then be fed directly into a Tracy Performance Analysis for the user to visualize the impacts of their tuning, seeing which operations were least performant and continuing in a gamified design loop for iterative performance tuning!
Building
These are the steps required to get the TT-MLIR project running on your machine
Please refer to the Dependencies section before building the project.
Environment setup
You only need to build this once, it builds llvm, flatbuffers and a python virtual environment.
cmake -B env/build env
cmake --build env/build
- It is recommended to use the system installation of python3 for the virtual environment. Please ensure that you do not already have a venv activated before running the above command.
- Please ensure the directory
/opt/ttmlir-toolchain
exist and its owner is the current user, i.e. the one that executes the abovecmake
commands. The commands create it and assign the proper ownership are:sudo mkdir -p /opt/ttmlir-toolchain sudo chown -R $USER /opt/ttmlir-toolchain
Build
source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17
cmake --build build
- To enable the ttnn/metal runtime add
-DTTMLIR_ENABLE_RUNTIME=ON
- Clang 17 is the minimum required version when enabling the runtime.
- To enable the ttnn/metal perf runtime add
-DTT_RUNTIME_ENABLE_PERF_TRACE=ON
- To accelerate the builds with ccache use
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache
- To accelerate builds further, if python bindings aren't needed,
-DTTMLIR_ENABLE_BINDINGS_PYTHON=OFF
. For some reason the python bindings link step is very slow.- TTNN build is automatically integrated / handled by tt-mlir cmake build system. For debugging and further information regarding the TTNN backend build step, please refer to TTNN Documentation.
- The runtime build step depends on the
ARCH_NAME
environment variable, which is set in theenv/activate
script. If you want to build the runtime for a different architecture, please setARCH_NAME
to the desired value (one ofgrayskull
,wormhole_b0
, orblackhole
). Please note that the runtime is built only ifTTMLIR_ENABLE_RUNTIME=ON
.- In addition to
ARCH_NAME
, the runtime build depends onTT_METAL_HOME
variable, which is also set inenv/activate
script. For more information, please refer to TT-NN and TT-Metailium installation documentation.
OS | Offline Compiler Only | Runtime Enabled Build | Runtime + Perf Enabled Build |
---|---|---|---|
Ubuntu 22.04 | ✅ | ✅ | ✅ |
Ubuntu 20.04 | ✅ | ❌ | ❌ |
MacOS | ✅ | ❌ | ❌ |
Test
source env/activate
cmake --build build -- check-ttmlir
llvm-lit
Under the hood the check-ttmlir cmake target is running llvm-lit
. With it you
can:
# Query which tests are available
llvm-lit -sv ./build/test --show-tests
# Run an individual test:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR/test_allocate.mlir
# Run a sub-suite:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR
See the full llvm-lit documentation for more information.
Lint
source env/activate
cmake --build build -- clang-tidy
Pre-Commit
Pre-Commit applies a git hook to the local repository such that linting is checked and applied on every git commit
action. Install from the root of the repository using:
source env/activate
pre-commit install
If you have already committed before installing the pre-commit hooks, you can run on all files to "catch up":
pre-commit run --all-files
For more information visit pre-commit
Docs
source env/activate
cmake --build build -- docs
mdbook serve build/docs
mdbook
can be installed with the system's package manager.mdbook serve
will by default create a local server athttp://localhost:3000
.
Note: If you want to build the docs on MacOS, there are two extra dependencies:
Both can be installed using Homebrew by running the following commands:
brew install doxygen
brew install graphviz
Dependencies
Ubuntu Common
Make sure to have Git LFS installed. You can install it with the following command:
sudo apt-get install git-lfs
Ubuntu 22.04
On Ubuntu 22.04 we need to install clang, ninja, and to update the version of cmake because 3.20 is the minimum required for this project.
sudo apt update
sudo apt upgrade
sudo apt install clang-17
sudo apt install ninja-build
sudo apt remove cmake -y
pip3 install cmake --upgrade
hash -r
Ensure cmake can by found in this path pip installed it to. E.g.
PATH=$PATH:$HOME/.local/bin
Then run the following command to see the cmake version which should be later than 3.20
cmake --version
We also need to install Ninja which can be done with the following command
sudo apt install ninja-build
MacOS
On MacOS we need to install the latest version of cmake, and ninja which can be done using Homebrew with (Docs for installing Homebrew: https://brew.sh).
brew install cmake
brew install ninja
Common Build Errors
TTMLIRPythonCAPI target requires changing an RPATH
CMake Error at /opt/ttmlir-toolchain/lib/cmake/llvm/AddLLVM.cmake:594 (add_library):
The install of the TTMLIRPythonCAPI target requires changing an RPATH from
the build tree, but this is not supported with the Ninja generator unless
on an ELF-based or XCOFF-based platform. The
CMAKE_BUILD_WITH_INSTALL_RPATH variable may be set to avoid this relinking
step.
If you get the above error, it means you tried to build with an old version of cmake or ninja and there is a stale file. To fix this, rm -rf
your build directory, install a newer version of cmake/ninja, and then rebuild. If you installed ninja via sudo apt install ninja-build
, it might still be not up-to-date (v1.10.0). You may use ninja in the python virtual environment, or install it via pip3 install -U ninja
, either way the version 1.11.1.git.kitware.jobserver-1
should work.
clang++ is not a full path and was not found in the PATH
CMake Error at CMakeLists.txt:2 (project):
The CMAKE_CXX_COMPILER:
clang++
is not a full path and was not found in the PATH.
Tell CMake where to find the compiler by setting either the environment
variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
to the compiler, or to the compiler name if it is in the PATH.
CMake Error at CMakeLists.txt:2 (project):
The CMAKE_C_COMPILER:
clang
is not a full path and was not found in the PATH.
Tell CMake where to find the compiler by setting either the environment
variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
the compiler, or to the compiler name if it is in the PATH.
If you get the following error, it means you need to install clang which you can do with sudo apt install clang
on Ubuntu.
sfpi
, trisc
, ncrisc
build failure
tt-forge-fe/third_party/tt-mlir/third_party/tt-metal/src/tt-metal/tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-g++: 1: version: not found
tt-forge-fe/third_party/tt-mlir/third_party/tt-metal/src/tt-metal/tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-g++: 2: oid: not found
size: '1961632': No such file
size: '1961632': No such file
size: '1961632': No such file
Always | FATAL | ncrisc build failed
If you got the above error, it means that SFPI or similar component build failed. First, make sure you have GIT LFS setup (e.g. sudo apt-get install git-lfs). Then, try to pull SFPI submodule manually:
cd third_party/tt-metal/src/tt-metal
git submodule foreach 'git lfs fetch --all && git lfs pull'
Then, try to build again.
Common Runtime Errors
Debugging python on macOS
When debugging python on macOS via lldb you may see an error like:
(lldb) r
error: process exited with status -1 (attach failed (Not allowed to attach to process. Look in the console messages (Console.app), near the debugserver entries, when the attach failed. The subsystem that denied t
he attach permission will likely have logged an informative message about why it was denied.))
For preinstalled macOS binaries you must manually codesign with debug entitlements.
Create file debuggee-entitlement.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>com.apple.security.cs.disable-library-validation</key>
<true/>
<key>com.apple.security.get-task-allow</key>
<true/>
</dict>
</plist>
Sign the binary:
sudo codesign -f -s - --entitlements debuggee-entitlement.xml /opt/ttmlir-toolchain/venv/bin/python
Internal Build Notes / IRD
- When building the runtime we must use Ubuntu 22.04 docker image
- When making an IRD reservation use
--docker-image yyz-gitlab.local.tenstorrent.com:5005/tenstorrent/infra/ird-ubuntu-22-04-amd64:latest
- When making an IRD reservation use
- You'll have to manaully install a newer version of cmake, at least 3.22, the easiest way to do this is to
pip install cmake
and make sure this one is in your path - You'll want LLVM installation to persist IRD reservations, you can achieve this by:
- mkdir /localdev/$USER/ttmlir-toolchain
- When requesting an IRD use
--volumes /localdev/$USER/ttmlir-toolchain:/opt/ttmlir-toolchain
Working with Docker Images
Components:
- Dockerfile
- Workflow for building Docker image
- Project build using Docker image
Overview
We use docker images to prepare project enviroment, install dependancies, tooling and prebuild toolchain. Project builds four docker images:
Base image tt-mlir-base-ubuntu-22-04
Dockerfile.base
CI image tt-mlir-ci-ubuntu-22-04
Dockerfile.ci
Base IRD image tt-mlir-base-ird-ubuntu-22-04
Dockerfile.ird
IRD image tt-mlir-ird-ubuntu-22-04
Dockerfile.ird
Base image starts with a supported base image (Ubuntu 22.04) and installs dependancies for project build. From there we build CI image that contains prebuild toolcahin and is used in CI to shoten the build time. IRD image contain dev tools like GDB, vim etc and shh and are use in IRD enviroments.
During the CI Docker build, the project is built and tests are run to ensure that everything is set up correctly. If any dependencies are missing, the Docker build will fail.
Building the Docker Image using GitHub Actions
The GitHub Actions workflow Build and Publish Docker Image builds the Docker images and uploads them to GitHub Packages at https://github.com/orgs/tenstorrent/packages?repo_name=tt-mlir. We use the git SHA we build from as the tag.
Building the Docker Image Locally
To test the changes and build the image locally, use the following command:
docker build -f .github/Dockerfile.base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ci-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird -build-args FROM_IMAGE=base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird -build-args FROM_IMAGE=ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest .
Using the Image in GitHub Actions Jobs
The GitHub Actions workflow Build in Docker uses a Docker container for building:
container:
image: ghcr.io/${{ github.repository }}/tt-mlir-ci-ubuntu-22-04:latest
options: --user root
Tools
Currently, there are a few primary tools that are part of the ttmlir
project:
ttmlir-opt
: Thettmlir
optimizer driver. This tool is used to run thettmlir
compiler passes on a.mlir
source files and is central to developing and testing the compiler.ttrt
: This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files.tt-explorer
: Visualizer tool forttmlir
-powered compiler results. Visualizes from emitted.mlir
files to display compiled model, attributes, performance results, and provide a platform for human-driven overrides to gameify model tuning.
ttmlir-opt
The ttmlir
optimizer driver. This tool is used to run the ttmlir
compiler passes on a .mlir
source files and is central to developing and testing the compiler.
Simple Test
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir
# Or
./build/bin/ttmlir-opt --ttir-to-ttmetal-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir
ttmlir-translate
The ttmlir-translate
translation utility. Unlike ttmlir-opt
tool which is used to run passes within the MLIR world, ttmlir-translate
allows us to ingest something (e.g. code) into MLIR world, and also produce something (e.g. executable binary, or even code again) from MLIR.
Generate C++ code from MLIR
# First, let's run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-load-system-desc --ttir-layout --convert-ttir-to-ttnn --convert-ttnn-to-emitc test/ttmlir/Dialect/TTNN/simple_multiply.mlir -o c.mlir
# Now run `ttmlir-translate` to produce C++ code
./build/bin/ttmlir-translate -mlir-to-cpp c.mlir -allow-unregistered-dialect
Bonus: These two commands can be piped, to avoid writing a mlir
file to disk, like so:
./build/bin/ttmlir-opt --ttir-load-system-desc --ttir-layout --convert-ttir-to-ttnn --convert-ttnn-to-emitc test/ttmlir/Dialect/TTNN/simple_multiply.mlir | ./build/bin/ttmlir-translate -mlir-to-cpp -allow-unregistered-dialect
Generate flatbuffer file from MLIR
# First run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir -o ttnn.mlir
# Now run `ttmlir-translate` to produce flatbuffer file
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn
ttrt
This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files. It enables the running of flatbuffer files without a front-end runtime.
Building
source env/activate
cmake --build build -- ttrt
ttrt --help
Building runtime mode
Add the following flags when building the compiler
-DTTMLIR_ENABLE_RUNTIME=ON
Building perf mode
Add the following flags when building the compiler
-DTTMLIR_ENABLE_RUNTIME=ON
-DTT_RUNTIME_ENABLE_PERF_TRACE=ON
LOGGER Levels
ttrt support logging at different logger levels. You will need to set env var TTRT_LOGGER_LEVEL
. By default, it will print all log messages.
TTRT_LOGGER_LEVEL=INFO
TTRT_LOGGER_LEVEL=CRITICAL
TTRT_LOGGER_LEVEL=ERROR
TTRT_LOGGER_LEVEL=WARNING
TTRT_LOGGER_LEVEL=DEBUG
Installing ttrt as python whls
Everytime you build ttrt, it will create a whls file in build/runtime/tools/python/build
. Ex filename ttrt-0.0.235-cp310-cp310-linux_x86_64.whl
. You can take this whls file and install it in any docker container and in any venv outside of ttmlir. After which, you can use all the following functionality as the same.
- Download whls
- Create a python venv
python -m venv ttrt_env
source ttrt_env/bin/activate
- Install whls (replace with your version of the whls)
pip install ttrt-0.0.235-cp310-cp310-linux_x86_64.whl
Generate a flatbuffer file from compiler
The compiler supports a pass to load a system descriptor to compile against. You can feed this pass into ttmlir-opt.
- Build ttmlir
- Build ttrt (see building section on this page)
- Generate ttsys file from the system you want to compile for using ttrt. This will create a
system_desc.ttsys
file underttrt-artifacts
folder.
ttrt query --save-artifacts
- Use ttmlir-opt tool in compiler to feed system descriptor. See the ttmlir-opt documentation for more information on how to generate .mlir files.
./build/bin/ttmlir-opt --ttir-load-system-desc="path=/path/to/system_desc.ttsys" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
or (pipe path directly into ttir-to-ttnn-backend-pipeline)
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=/path/to/system_desc.ttsys" test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
- Use ttmlir-translate tool in compiler to generate the flatbuffer executable. See the ttmlir-translate documentation for more information on how to generate flatbuffer files.
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn
- Run your test cases using ttrt
ttrt run /path/to/out.ttnn
Generate flatbuffer files using llvm-lit
There are already existing .mlir test cases under test/ttmlir/Silicon
. You can use llvm-lit tool to generate the corresponding ttnn and ttm files.
- Build ttmlir
- Build ttrt (see building section on this page)
- Generate ttsys file from the system you want to compile for using ttrt. This will create a
system_desc.ttsys
file underttrt-artifacts
folder.
ttrt query --save-artifacts
- Export this file in your environment using
export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys
. When llvm-lit is run, it will query this variable and generate the ttnn and ttm files using this system. Optionally, you can also provide this manually when running llvm-lit. - Generate your test cases. This will generate all your ttnn and ttm files under
build/test/ttmlir/Silicon
. ttnn files have a.ttnn
file extension and ttmetal files have a.ttm
extension.
cmake --build build -- check-ttmlir
- (Optional) If you have a single .mlir file (or a directory of custom .mlir files) that you created using the compiler, and you want to generate the corresponding ttnn and ttm files for it, you can run llvm-lit standalone to the path of your .mlir file or directory of .mlir files to generate the flatbuffer executables. You will have to make sure you add in the correct llvm-lit configs into your .mlir file. See section on adding llvm-lit config options inside a .mlir file to create flatbuffer binaries for more info. You must also make sure your .mlir test is found within test/ttmlir/Silicon folder (and point lit to the build folder)!
llvm-lit -v ./build/test/ttmlir/Silicon
or
SYSTEM_DESC_PATH=/path/to/system_desc.ttsys llvm-lit -v ./build/test/ttmlir/Silicon
- Run your test cases using ttrt
ttrt run /path/to/test.ttnn
ttrt run /path/to/dir/of/flatbuffers
Adding llvm-lit config options inside a .mlir file to create flatbuffer binaries
Inside of your .mlir file, you can add certain config options that llvm-lit will use when running against that test case. For the purpose of generating flatbuffer executables, you can add --ttir-load-system-desc="path=%system_desc_path%"
which will tell llvm-lit to parse the system desc found from the environment flag set by export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys
. You can also paste a custom path to a system desc file as well.
// RUN: ttmlir-opt --ttir-load-system-desc="path=%system_desc_path%" --ttir-implicit-device --ttir-layout --convert-ttir-to-ttnn %s > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn
Adding new mlir test cases
You can copy your .mlir test file (with the appropriate llvm-lit config options for generating flatbuffer binaries) into test/ttmlir/Silicon
. Then, follow generating flatbuffer files using llvm-lit to generate the executables to run!
Versioning
ttrt and flatbuffers have strict versioning check. When running a flatbuffer against ttrt, you have to make sure the flatbuffer was generated using the same version as ttrt (or vice versa). Major and Minor versions are manually set using github tags when releases are made. Patch versioning is the number of commits from the last major/minor tag.
vmajor.minor.patch
APIs
ttrt --help
ttrt read
ttrt run
ttrt query
ttrt perf
ttrt check
Command Line
There are different ways you can use the APIs under ttrt. The first is via the command line as follows. All artifacts are saved under ttrt-artifacts
folder under TT_MLIR_HOME
environment variable. By default, all logging is printed to the terminal. You can specify a log file to dump output to.
read
Read sections of a binary file
ttrt read --help
ttrt read --section mlir out.ttnn
ttrt read --section cpp out.ttnn
ttrt read --section version out.ttnn
ttrt read --section system_desc out.ttnn
ttrt read --section inputs out.ttnn
ttrt read --section outputs out.ttnn
ttrt read --section all out.ttnn
ttrt read --section all out.ttnn --clean-artifacts
ttrt read --section all out.ttnn --save-artifacts
ttrt read --section all /dir/of/flatbuffers
ttrt read system_desc.ttsys
ttrt read --section system_desc system_desc.ttsys
ttrt read system_desc.ttsys --log-file ttrt.log
ttrt read out.ttnn --save-artifacts --artifact-dir /path/to/some/dir
ttrt read out.ttnn --result-file result.json
run
Run a binary file or a directory of binary files
Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON
.
ttrt run --help
ttrt run out.ttnn
ttrt run out.ttnn --seed 0
ttrt run out.ttnn --init arange
ttrt run out.ttnn --identity
ttrt run out.ttnn --identity --rtol 1 --atol 1
ttrt run out.ttnn --clean-artifacts
ttrt run out.ttnn --save-artifacts
ttrt run out.ttnn --loops 10
ttrt run --program-index all out.ttnn
ttrt run --program-index 0 out.ttnn
ttrt run /dir/of/flatbuffers
ttrt run /dir/of/flatbuffers --loops 10
ttrt run /dir/of/flatbuffers --log-file ttrt.log
ttrt run out.ttnn --save-artifacts --artifact-dir /path/to/some/dir
ttrt run out.ttnn --load-kernels-from-disk
ttrt run out.ttnn --enable-async-ttnn
ttrt run out.ttnn --result-file result.json
query
Query the system to obtain the system desc file (optionally store it to disk)
Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON
.
ttrt query --help
ttrt query
ttrt query --quiet
ttrt query --save-artifacts
ttrt query --clean-artifacts
ttrt query --save-artifacts --log-file ttrt.log
ttrt query --save-artifacts --artifact-dir /path/to/some/dir
ttrt query --result-file result.json
perf
Run performance mode of a binary file or a directory of binary files
Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON
. Also need perf enabled build -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
.
Note: You can collect host only related performance data via --host-only
flag. By default, host and device side performance data are both collected.
If the saving artifacts flag is provided, perf mode will dump the following files in the artifacts directory
ops_perf_results.csv : compiled op performance results
profile_log_device.csv : dump of all device side profiled results
tracy_ops_data.csv : op data results dumped in a readable format
tracy_ops_times.csv : op time results dumped in a readable format
tracy_profile_log_host.tracy : tracy profiled results file, this file can be fed into the tracy GUI
ttrt perf --help
ttrt perf out.ttnn
ttrt perf out.ttnn --clean-artifacts
ttrt perf out.ttnn --save-artifacts
ttrt perf out.ttnn --loops 10
ttrt perf --program-index all out.ttnn
ttrt perf --program-index 0 out.ttnn
ttrt perf --host-only out.ttnn
ttrt perf /dir/of/flatbuffers --host-only
ttrt perf /dir/of/flatbuffers --loops 10 --host-only
ttrt perf /dir/of/flatbuffers --log-file ttrt.log --host-only
ttrt perf --save-artifacts --artifact-dir /path/to/some/dir
ttrt perf out.ttnn --result-file result.json
To use the Tracy GUI, run the following instructions on your macbook. You can upload your .tracy file into the GUI to view the profiled dumps.
git clone https://github.com/tenstorrent-metal/tracy.git
cd tracy/profiler/build/unix
make all
./Tracy-release
check
Check a binary file or a directory of binary files against a system desc (by default, uses the host machine)
Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON
.
ttrt check --help
ttrt check out.ttnn
ttrt check out.ttnn --system-desc /path/to/system_desc.ttsys
ttrt check out.ttnn --clean-artifacts
ttrt check out.ttnn --save-artifacts
ttrt check out.ttnn --log-file ttrt.log
ttrt check /dir/of/flatbuffers --system-desc /dir/of/system_desc
ttrt check --save-artifacts --artifact-dir /path/to/some/dir out.ttnn
ttrt check out.ttnn --result-file result.json
ttrt as a python package
The other way to use the APIs under ttrt is importing it as a library. This allows the user to use it in custom scripts.
Import ttrt as a python package
from ttrt.common.api import API
Setup API and register all features
API.initialize_apis()
Setup arguments
You can specify certain arguments to pass to each API, or use the default arguments provided
args
This can be a dictionary of values to set inside your API instance. These are the same options as found via the command line. You can get the total list of support arguments via ttrt help command line. Any argument not provided will be set to the default.
custom_args = {}
custom_args["--clean-artifacts"] = True
query_instance = API.Query(args=custom_args)
logging
You can specify a specific logging module you want to set inside your API instance. The rationale behind this is to support different instances of different APIs, all being able to be logged to a different file.
from ttrt.common.util import Logger
log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)
read_instance = API.Read(logger=custom_logger)
artifacts
You can specify a specific artifacts directory to store all the generate metadata during the execution of any API run. This allows you to specify different artifact directories if you wish for different instances of APIs.
from ttrt.common.util import Artifacts
log_file_name = "some_file_name.log"
artifacts_folder_path = "/opt/folder"
custom_logger = Logger(log_file_name)
custom_artifacts = Artifacts(logger=custom_logger, artifacts_folder_path=artifacts_folder_path)
run_instance = API.Run(artifacts=custom_artifacts)
Execute API
Once all the arguments are setup, you can run your API instance with all your provided arguments. Note, APIs are stateless. Thus, subsequent calls to the same API instance will not preserve previous call artifacts. You can generate a new artifacts directory for subsequent runs if you wish to call the APIs multiple times, for example.
result_code, results = query_instance()
result_code, results = read_instance()
result_code, results = run_instance()
Putting it all together
You can do interesting stuff when combining all the above features into your python script
from ttrt.common.api import API
from ttrt.common.util import Logger
from ttrt.common.util import Artifacts
API.initialize_apis()
custom_args = {}
custom_args["--clean-artifacts"] = True
custom_args["--save-artifacts"] = True
custom_args["--loops"] = 10
custom_args["--init"] = "randn"
custom_args["binary"] = "/path/to/subtract.ttnn"
log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)
artifacts_folder_path = "/opt/folder"
custom_artifacts = Artifacts(logger=custom_logger, artifacts_folder_path=artifacts_folder_path)
run_instance = API.Run(args=custom_args, logger=custom_logger, artifacts=custom_artifacts)
result_code, results = run_instance()
FAQ
Flatbuffer version does not match ttrt version!
- ttrt and flatbuffer have strict versioning that is checked during ttrt execution. You will have to generate a flatbuffer using the same version of ttrt (or vice versa). This mean you might have to build on the same branch on which the flatbuffer was generated or regenerate the flatbuffer using your current build.
System desc does not match flatbuffer!
- flatbuffers are compiled using a specific system desc (or default values if no system desc is provided). During runtime, the flatbuffer system desc is checked against the current system to ensure the system being run on supports the flatbuffer that was compiled. If you get this error, you will have to regenerate the flatbuffer using the system you want to run on. See generate a flatbuffer file from compiler section on how to do this.
I just want to test and push my commit! What do I do!
- follow these steps (on both n150 and n300)
1. Build ttmlir (sample instructions - subject to change)
source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17 -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DTTMLIR_ENABLE_RUNTIME=ON -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
cmake --build build
2. Build ttrt (sample instructions - subject to change)
cmake --build build -- ttrt
3. Query system
ttrt query --save-artifacts
4. Export system desc file
export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys (path dumped in previous command)
5. Generate test cases
cmake --build build -- check-ttmlir
6. Run test cases
ttrt run build/test/ttmlir/Silicon
7. (Optional) Run perf test cases
ttrt perf build/test/ttmlir/Silicon
tt-explorer
Welcome to the tt-explorer wiki! The Wiki will serve as a source for documentation, examples, and general knowledge related to the TT-MLIR visualization project. The sidebar will provide navigation to relevant pages. If this is your first time hearing about the project, take a look at Project Architecture for an in-depth introduction to the tool and motivations behind it :)
Quick Start
TT-Explorer is made to be as painless as possible, as such the installation on top of the pre-existing tt-mlir
project is as minimal as possible.
- Build
tt-mlir
- Run
source env/activate
to be intt-mlir
virtualenv for the following steps - Install
tt-adapter
usingpip install -e .
in tt-adapter root directory. - Install
tt-explorer
usingpip install -e .
in tt-explorer root directory - Run
tt-explorer
in terminal to start tt-explorer instance. (Refer to CLI section in API for specifics) - Ensure server has started in
tt-explorer
shell instance (check for message below)
Starting Model Explorer server at:
http://localhost:8080
Visualizer tool for ttmlir
-powered compiler results. Visualizes from emitted .mlir
files to display compiled model, attributes, performance results, and provide a platform for human-driven overrides to gameify model tuning.
TT-Explorer - Project Architecture
TT-Explorer is a tool made to ease the pain of tuning a model and developing on Tenstorrent hardware. It provides a “Human-In-Loop” interface such that the compiler results can be actively tuned and understood by the person compiling the model. To complete this goal, the tool has to be designed such that users of any level of experience are all able to glean useful information from the visualization of the model, and be able to explore what the model does.
Software Architecture
The software will be built around the TT-Forge compiler to provide most of the functionality. Model Explorer will be used for the visualization functionality and as the main platform upon which TT-Explorer is built on.
Since Model-Explorer is built using Python, the majority of TT-Explorer will be structured in Python, with frequent use of the bindings to C++ provided by TT-MLIR.
The following components will be put together:
TT-Forge-FE (Front End)
TT-Forge FE is currently the primary frontend which uses TVM to transform conventional AI models into the MLIR in the TTIR Dialect.
Ingests: AI Model defined in PyTorch, TF, etc… Emits: Rudimentary TTIR Module consisting of Ops from AI Model.
TT-MLIR
TT-MLIR currently defines the out-of-tree MLIR compiler created by Tenstorrent to specifically target TT Hardware as a backend. It comprises a platform of several dialects (TTIR, TTNN, TTMetal) and the passes and transformations to compile a model into an executable that can run on TT hardware. In the scope of TT-Explorer the python bindings will be leveraged.
Ingests: TTIR Module, Overrides JSON Emits: Python Bindings to interface with TTIR Module, Overridden TTIR Modules, Flatbuffers
TT-Adapter
Model Explorer provides an extension interface where custom adapters can be implemented to visualize from different formats. TT-Adapter is the adapter created for TT-Explorer that parses TTIR Modules using the Python Bindings provided by TT-MLIR to create a graph legible by model-explorer. It also has an extensible REST endpoint that is leveraged to implement functionality, this endpoint acts as the main bridge between the Client and Host side processes.
Ingests: TTIR Modules, TT-MLIR Python Bindings, REST API Calls Emits: Model-Explorer Graph, REST API Results
TTRT
TT-RT is the runtime library for TT-Forge, which provides an API to run Flatbuffers generated from TT-MLIR. These flatbuffers contain the compiled results of the TTIR module, and TTRT allows us to query and execute them. Particularly, a performance trace can be generated using Tracy, which is fed into model-explorer to visualize the performance of operations in the graph.
Ingests: Flatbuffers Emits: Performance Trace, Model Results
Model-Explorer
Model Explorer is the backbone of the client and visualization of these models. It is deceptively placed in the “Client” portion of the diagram, but realistically TT-Explorer will be run on the host, and so will the model-explorer instance. The frontend will be a client of the REST API created by TT-Adapter and will use URLs from the model-explorer server to visualize the models.
Ingests: Model Explorer Graph, User-Provided Overrides (UI), Performance Trace Emits: Overrides JSON, Model Visualization
These components all work together to provide the TT-Explorer platform.
Client-Host Design Paradigm
Since performance traces and execution rely on Silicon machines, there is a push to decouple the execution and MLIR-environment heavy aspects of TT-Explorer onto some host device and have a lightweight client API that uses the REST endpoint provided by TT-Adapter to leverage the host device without having to constantly be on said host. This is very useful for cloud development (as is common Tenstorrent). In doing so, TT-Explorer is a project that can be spun up in either a tt-mlir
environment, or without one. The lightweight python version of TT-Explorer provides a set of utilities that call upon and visualize models from the host, the host will create the server and serve the API to be consumed.
TT-Explorer
The following is a listed reference for the API in using TT-Explorer, check the TT-Adapter API reference below.
TTExplorer
Overview
The TTExplorer
class is responsible for interacting with the model_explorer server, including uploading models, initializing settings, and executing models.
Initialization
__init__(self, port=8080, url="http://localhost", server=False, config=None)
Initializes the TTExplorer instance.
- Parameters:
port (int)
: The port number for the model_explorer server. Default is 8080.url (str)
: The base URL for the model_explorer server. Default is"http://localhost"
.server (bool)
: Flag to indicate if the server should be created. If this is set to true, ensure an environment where thettrt
andttmlir
python bindings is used. Default is False.config (dict)
: Configuration for the model_explorer server. Default is None.
Methods
get_model_path(self, file) -> str
Uploads a model file to the model_explorer server and returns the temporary path provided by the server.
- Parameters:
file (file-like object)
: The model file to be uploaded.
- Returns:
str
: The temporary path of the uploaded model file.
initialize(self, settings={})
Initializes the server-side TT-Explorer
by assigning a System Descriptor for future operations, needed to execute models.
- Parameters:
settings (dict)
: Settings for initialization, currently none. Default is an empty dictionary.
- Returns:
dict
: dict withsystem_desc_path
key pointing to server-path to System Descriptor
execute_model(self, model_path: str, settings={})
Executes a model on the model_explorer server with the provided settings.
- Parameters:
model_path (str)
: Server path tottir
module to be executed, ensure that module has been uploaded first.settings (dict)
: Settings for execution. Default is an empty dictionary."ttir_to_ttnn_options": List[str]
Pipeline options to be fed intottir-to-ttnn-backend-pipeline
's String Parser"artifact_dir": str(Path)
A valid Server-Path to store artifacts from execution, if this flag is set then artifacts are not automatically deleted after execution is complete.
- Returns:
dict
: Relevant emitted files from Execution"log_file": str
: Log-File fromttrt perf
call"stdout": str
: STDOUT fromttrt perf
call, utf-8 decoded."perf_trace": str
: CSV Performance Trace from module run.
Example Usage
# Initialize TTExplorer
explorer = TTExplorer(server=True)
# Explorer instance now running on thread on http://localhost:8080
# Make sure you wait until the thread has started the Flask server, you can check by viewing STDOUT.
# Upload a model file
file = open('my-module.mlir', 'r')
model_path = explorer.get_model_path(file)
# Since local==server, the model_path is moved to a tempfile on the same machine
# Initialize the SystemDesc on Machine for execution purposes
explorer.initialize()
# Execute the model, store artifacts permanently in home directory.
resp = explorer.execute_model(model_path, settings={'artifact_dir': '/home/<user>/ttrt-artifacts'})
csv = resp['perf_trace'] # Do with the CSV trace as you would like to view the performance results!
TT-Adapter
The following is a reference for the "REST" API provided by TT-Adapter. First, a short info-dump on how an extensible API can be built on top of Model Explorer.
Building an API using Model Explorer
The /apipost/v1/send_command
endpoint provides an extensible platform with which commands are sent to be executed directly by the adapter specified. This becomes the main endpoint through which communication is facilitated between the server and client, the commands respond with an "adapter response".
Sending Commands
The body of the command must be JSON, and only the following fields are fed into the adapter functions:
cmd = {
"extensionId": "tt_adapter", // tt_adapter to invoke functions from TT-Adapter
"cmdId": "<name of function>", // Name of function to be run, "convert" is built into all adapters to convert some model to graph
"modelPath": "<local_path to file>", // Path to model on server to be fed into function
"deleteAfterConversion": False, // True if file at modelPath is to be deleted after function run
"settings": {...}, // Object holding custom settings to be fed into function
}
More often than not, functions do not need all of these fields, but they must all be present to properly process the command sent into the function. Speaking of function, the function signature that all commands have to follow is as such:
class TTAdapter(Adapter):
...
def my_adapter_fn(self, model_path: str, settings: dict):
pass # Parse model_path and settings objects as they are fed from send_command endpoint.
This function is invoked and called from a new instance every time. This is important to understand for the idea of persisting information on the server. The onus is often on the end-user to store and preserve important information such as the path of a model they've uploaded, or the paths of important artifacts that the server has produced. TTExplorer
aims to make this as easy as possible.
Information can be processed in this function however the user would like to define, and often settings becomes a versatile endpoint to provide more information and context for the execution of some function. As an example, refer to TTAdapter:initialize
, this function to load a SystemDesc into the environment has little to do with modelPath
or deleteAfterConversion
, as such these variables are not processed at all, and the function only executes a static initialization process regardless of the parameters passed into the command.
Adapter Response
Model Explorer was probably not made to allow for such an extensible framework to be tacked onto it. As such, the adapter response is processed in a very particular way before it is sent back to the user. In particular, refer to model_explorer.utils.convert_adapter_response
which is run on the output of every function. This means that responses can only be in JSON format and are constructed as:
{
"graphs": [
{/* response goes here */},
]
}
for custom adapter responses. This limits the transfer of raw bytes data through different MIME Types, and requires the tt_adapter.utils.to_adapter_format
which turns any dict
object into a model explorer adapter compatible response. While this framework works well for graphs, it makes an "extensible" API difficult to implement.
Current API Reference:
Initialize
Called from TTExplorer.initialize
, used to Load SystemDesc into environment.
cmd = {
"extensionId": "tt_adapter",
"cmdId": "initialize",
"modelPath": "", // Irrelevant
"deleteAfterConversion": False,
"settings": {}, // None at the moment
}
// RESPONSE
{"system_desc_path": "<path to system_desc.ttsys>"}
Execute
Called from TTExplorer.execute_model
, executes a model.
cmd = {
"extensionId": "tt_adapter",
"cmdId": "execute",
"modelPath": "<server-path-to-model>",
"deleteAfterConversion": False, // Can be set to True if TTIR module is to be deleted after run
"settings": {
"ttir_to_ttnn_options": List[str], // Pipeline Options to feed into ttir_to_ttnn_backend_pipeline
"artifact_dir": str, // Path on server to store TTRT Artifacts to, artifacts are not deleted after perf if set.
},
}
// RESPONSE
{
"stdout": "<raw text output to STDOUT from TTRT Perf Run>",
"log_file": "<raw logging output>",
"perf_trace": "<raw CSV perf trace collected from TTRT Perf Run, Not present if TTRT Perf failed>",
}
Convert
Standard built-in conversion function, converts TTIR Module into Model Explorer Graph. Also provides settings
as a platform for overrides to be applied to the graph.
cmd = {
"extensionId": "tt_adapter",
"cmdId": "convert",
"modelPath": "<server-path-to-ttir-module>",
"deleteAfterConversion": True/False,
"settings": {/* Overrides */}, // Undefined at the moment
}
// RESPONSE
<model-explorer-graph JSON Object>
Milestone 1 (v0.1)
Main Goal - Visualize & Execute
This will highlight half of the essential work that this tool should be able to do in both visualizing a model and executing it using the current TT-Forge stack. The frontend transformation of a model -> TTIR will be done outside of the scope of TT-Explorer at the moment. For this milestone TT-Explorer will be able to spin up a host-side and a client-side instance. The tool will be able to ingest TTIR modules to produce a visual result, and be able to execute this module. Ambitiously, the performance traces should be collected back into TT-Explorer to be displayed.
Tasks:
-
Load TTIR Modules and Visualize TTIR-Ops in Model Explorer -
Create Extensible Notebook UX allowing for visualization and scripting capabilities -
Add functionality to Model Explorer to load from re-compiled TTIR Modules (might be from JSON) -
Add functionality to TT-MLIR to execute from Python Bindings -
Create REST API skeleton in TT-Adapter -
From REST API Call, Invoke python bindings to execute TTIR module using TT-Adapter - (If possible) Parse Perf Trace Artifact and visualize performance in Model-Explorer (as Node Data)
Milestone 2 (v0.2)
Main Goal - Model Editor
The primary function of TT-Explorer is to visualize and edit the model according to what the user defines as overrides the automatically generated compiler results. This milestone highlights that functionality in TT-Explorer, focusing around providing UI, TT-MLIR, and TT-Explorer features that enable the user to edit and tune a model “in-loop” with the TT-Forge compiler.
Tasks:
- Flesh out and test locations ID such that operations can be tracked through the compiler stack.
- Use Loc IDs to bind TTIR Ops with Tracy Perf Trace Artifact, and send to Model-Explorer to visualize.
- Implement Overrides Functionality into TT-MLIR, tracking based on Loc IDs.
- Overhaul UI to enable editing node attributes, use these updated fields to send information back to TT-Explorer via REST API (in the form of an Overrides JSON)
- Parse Overrides JSON and apply Overrides over a REST API Call, visualize re-compiled graph now.
- Provide REST API endpoint to track “legal” configurations and provide “legal” options attached to Graph JSON.
Milestone 3 (v0.3+)
Main Goal - Matured Tool and Extensibility
The focus of this milestone is to transition TT-Explorer from a prototype tool into a mature visualization and editing tool for “Human-In-Loop” compilation. The tool is now planned to made extensible for other dialects and entry points forecast into TT-MLIR (Jax, StableHLO, etc…) and development of the visualization components of the tool provide feedback to upstream repos like model-explorer
. Here the focus is on providing extensible interfaces for new UI elements (in supporting multi-chip and beyond), REST API, and Overrides.
Tasks:
- Begin researching autogenerated Python bindings for pipelines and transformations defined in C++.
- Create modular frontend capabilities out of Flask app in Model-Explorer
- Create a “mono-adapter” which holds the paths to invoke dialect-specific adapters for each dialect to be supported by TT-Explorer
-
Begin adding new dialects like
.ttm
,.ttnn
to Model Explorer so that complied results can be inspected and analyzed to optimize at different steps of the compiler. - To be defined later, depending on the growth of the MLIR Project
Flatbuffers
Flatbuffers are the binary serialization format used by TTMLIR and they currently come in a few flavors (designated by the file extension):
.ttsys
: A system description file that is the mechanism for supplying target information to the compiler. These can be collected on a target machine and downloaded to a development machine to enable cross-compilation..ttnn
: A compiled binary file intended to be loaded and executed by the TTNN backend runtime..ttb
: A compiled binary file intended to be loaded and executed by the TTMetal backend runtime (Unsupported).
ci
Our CI infrastructure is currently hosted on cloud. Cloud machines are used and linked as GitHub runners.
Key Words
Target Silicon (coming soon)
- 1:1 mapping to unique system-desc (this is because an n150 card can have different harvested rows)
Target Family
- product type (n150, n300)
Target Capabilities (coming soon)
- describes testable traits of Target Family
n150: {
test params to use if running on n150
}
n300: {
test params to use if running on n150
}
Test Capabilities (coming soon)
- set of target capabilities defined in the test
- test will populate certain parameters depending on the Target Family/Target Silicon it is running on
GitHub Runner CI Tags
Runner Use
There are 2 types of runner machines. Builders build offline and runners are silicon machines.
- builder
- runner
Runner Type
There are 2 runner types. Bare metals are standalone and virtual machines are kubernetes pods.
- bare-metal
- virtual-machine
Architecture
Supported architectures
- wormhole_b0
- blackhole (coming soon)
Pipeline Type
Supported pipelines
- perf
- functional
Active
Defines whether a runner is in service or taken out of service for maintenance
- in-service
- out-of-service
Target Family
Supported configurations of machines
- n150
- n300
- t3000 (coming soon)
- tg (coming soon)
- tgg (coming soon)
Target Silicon (coming soon)
-silicon-n150-0 (0th row harvested)
-silicon-n150-1 (1th row harvested)
-silicon-n300-0-0 (0th row harvested both chips)
Pipeline durations
- push: every push to main
- pr: every PR
CI Test Flow
1. GitHub runner
- build tt-mlir
- build ttrt
- upload artifacts
2. Silicon runner
- download tt-mlir / ttrt artifacts
- ttrt generate system desc
- llvm-lit runs all unit test, including silicon ones to generate flatbuffers (will only generate ones that are supported for that test file)
- ttrt runs generated flatbuffers
Adding a test
When adding a test, you can specify when the test should run and what values it should inherit. The test defines how it should run, not the infrastructure. The infrastructure will execute what the test defines. For now, if you specify nothing, it will run on all default parameters. Note: if you provide a target family, then it will be default run on any target silicon machine. If you need a specific target silicon machine (eg one with 1st row harvested), specify it in Target Silicon. Note: if you specify perf pipeline, it will automatically run on a bare metal machine Default parameters
[Architecture]: [wormhole_b0]
[Pipeline]: [functional, perf]
[Target Family]: [n150, n300]
[Target Silicon]: []
[Duration]: [push]
Location: test/ttmlir/Silicon
File Type: .mlir
REQUIRES: [Architecture] [Pipeline] [Target Family] [Target Silicon] [Duration] (coming soon)
UNSUPPORTED: [Target Family] [Target Silicon] (coming soon)
Additional Reading
This section contains pointers to reading material that may be useful for understanding the project.
MLIR
- https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index.html
- https://mlir.llvm.org/docs/Tutorials/Toy/
- https://www.jeremykun.com/2023/08/10/mlir-getting-started/
- https://arxiv.org/pdf/2002.11054
- https://ieeexplore.ieee.org/abstract/document/9370308
Dialects
- affine dialect
- Affine map is a really powerful primitive that can be used to describe most data movement patterns.
- It can also be used to describe memory layouts.
- linalg dialect
- tosa dialect
- tosa spec
- memref dialect
- torch-mlir
- onnx-mlir
- triton-mlir
Tablegen
LLVM Testing Framework Tools
Jax
Flatbuffer
Openxla Website
openxla
StableHLO
Contributor Covenant Code of Conduct
Our Pledge
We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
Our Standards
Examples of behavior that contributes to a positive environment for our community include:
- Demonstrating empathy and kindness toward other people
- Being respectful of differing opinions, viewpoints, and experiences
- Giving and gracefully accepting constructive feedback
- Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
- Focusing on what is best not just for us as individuals, but for the overall community
Examples of unacceptable behavior include:
- The use of sexualized language or imagery, and sexual attention or advances of any kind
- Trolling, insulting or derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or email address, without their explicit permission
- Other conduct which could reasonably be considered inappropriate in a professional setting
Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
Scope
This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at nsmith@tenstorrent.com or staylor@tenstorrent.com. All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the reporter of any incident.
Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
1. Correction
Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
2. Warning
Community Impact: A violation through a single incident or series of actions.
Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
3. Temporary Ban
Community Impact: A serious violation of community standards, including sustained inappropriate behavior.
Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
4. Permanent Ban
Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.
Consequence: A permanent ban from any sort of public interaction within the community.
Attribution
This Code of Conduct is adapted from the Contributor Covenant, version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
Community Impact Guidelines were inspired by Mozilla's code of conduct enforcement ladder.
For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.
Project Structure
env
: Contains the environment setup for building project dependencies, such as LLVM and Flatbuffersinclude/ttmlir
: Public headers for the TTMLIR libraryDialect
: MLIR dialect interfaces and definitions, dialects typically follow a common directory tree structure:IR
: MLIR operation/type/attribute interfaces and definitionsPasses.[h|td]
: MLIR pass interfaces and definitionsTransforms
: Common MLIR transformations, typically invoked by passes
Target
: Flatbuffer schema definitions. This defines the binary interface between the compiler and the runtime
lib
: TTMLIR library implementationCAPI
: C API for interfacing with the TTMLIR library, note this is needed for implementing the python bindings. Read more about it here: https://mlir.llvm.org/docs/Bindings/Python/#use-the-c-apiDialect
: MLIR dialect implementations
runtime
: Device runtime implementationinclude/tt/runtime
: Public headers for the runtime interfacelib
: Runtime implementationtools/python
: Python bindings for the runtime, currently this is wherettrt
is implemented
test
: Test suitetools/ttmlir-opt
: TTMLIR optimizer driver
Namespaces
mlir
: On the compiler side, we use the MLIR namespace for all MLIR types and operations and subnamespace for our dialects.mlir::tt
: Everything ttmlir related is underneath this namespace. Since we need to subnamespace undermlir
, justmlir::tt
seemed better thanmlir::ttmlir
which feels redundant.mlir::tt::ttir
: The TTIR dialect namespacemlir::tt::ttnn
: The TTNN dialect namespacemlir::tt::ttmetal
: The TTMetal dialect namespacemlir::tt::ttkernel
: The TTKernel dialect namespace
tt::runtime
: On the runtime side, we use thett::runtime
namespace for all runtime types and operations.tt::runtime::ttnn
: The TTNN runtime namespacett::runtime::ttmetal
: The TTMetal runtime namespace (not implemented)
Dialects Overview
Here is a brief overview of the dialects in the project, please refer to the individual dialect documentation for more details.:
tt
: Common types such as,tt.tile
,tt.layout
,tt.grid
, etc. and enums such as, data formats, memory spaces, iterator types etc.ttir
: A high level dialect that models the tensor compute graph on tenstorrent devices. Acceptstosa
andlinalg
input.ttir.generic
: Generically describe compute work.ttir.to_layout
: Convert between different tensor memory layouts and transfer between different memory spaces.tensor.pad
: Pad a tensor with a value (ie. convs)ttir.yield
: return result memref of computation in dispatch region body, lowers tottkernel.yield
ttir.kernel
: lowers to some backend kernel
ttnn
: A TTNN dialect that models ttnn API.ttkernel
: Tenstorrent kernel library operations.ttkernel.noc_async_read
ttkernel.noc_async_write
ttkernel.cb_push_back
ttkernel.[matmul|add|multiply]
: Computations on tiles in source register space, store the result in dest register space.ttkernel.sfpu_*
: Computations on tiles in dest register space using sfpu coprocessor.
ttmetal
: Operations that dispatch work from host to device.ttmetal.dispatch
: Dispatch a grid of compute work.
Adding an Op
This guide will walk you through the process of adding a new Op end to end in
tt-mlir, in this case we will be adding a matmul
operation. Note that the matmul
op was added as part of the same changeset as this guide, it could be useful to
reference the diff alongside this guide to see the changes in full.
This guide will cover the following steps:
- Adding an Op
- 1. Define the Op in the TTIR frontend dialect
- 2. Define the Op in the TTNN backend dialect
- 3. Convert / Implement the Op in the TTNN passes
- 4. Add a compiler unit test for the Op
- 5. Define flatbuffer schema for the Op
- 6. Serialize the Op in the flatbuffer format
- 7. Add runtime support for the Op
- 8. Add a silicon unit test for the Op
1. Define the Op in the TTIR frontend dialect
We will start by defining the Op in the TTIR dialect. The TTIR Ops are defined
in a tablegen file located at include/ttmlir/Dialect/TTIR/IR/TTIROps.td
.
Tablegen is a domain-specific language for defining ops/types/attributes in MLIR and LLVM, these definitions constitute the dialect's Operation Definition Specification (ODS).
Here is an example of defining matmul
in the TTIR dialect:
def TTIR_MatmulOp : TTIR_DPSOp<"matmul"> {
let summary = "Matrix multiply operation.";
let description = [{
Matrix multiply operation.
}];
let arguments = (ins AnyRankedTensor:$a,
AnyRankedTensor:$b,
AnyRankedTensor:$output,
TT_OperandConstraintArrayAttr:$operand_constraints);
let results = (outs AnyRankedTensor:$result);
let extraClassDeclaration = [{
MutableOperandRange getDpsInitsMutable() { return getOutputMutable(); }
}];
let hasVerifier = 1;
}
There are many things to break down here, starting from the top:
def
in tablegen is used to define a concrete type, this will have a 1-1 mapping to a C++ generated class, and for this particular case the build will end up generating filebuild/include/ttmlir/Dialect/TTIR/IR/TTIROps.h.inc
.- It inherits from
class TTIR_DPSOp
, classes in tablegen don't define a concrete type, but rather an interface that augment or constrain inheriteddef
s.TTIR_DPSOp
is a class that defines the common attributes for all TTIR Ops that implement Destination Passing Style (DPS) semantics. DPS just means that the result tensor is passed as an argument to the operation which will be critical for modeling buffer allocation / lifetimes. Note the 3rd argumentAnyRankedTensor:$output
. - Next we have a list of
arguments
. These arguments consist of a mixture ofType
s (i.e.AnyRankedTensor
) andAttribute
s (i.e.TT_OperandConstraintArrayAttr
). Read more about Types & Attributes here.AnyRankedTensor
is part of a tablegen standard library which type aliases to MLIR's builtin Tensor type, with the added constraint that the tensor has a static rank. As much as possible we want to use the builtin types and infrastructure provided by MLIR.TT_OperandConstraintArrayAttr
is a custom attribute that we have defined in theTT
dialect. This attribute is used to specify constraints on the operands of the operation. For example, theTTIR_MatmulOp
requires that the input tensors be in tile layout, this attribute captures this constraint.
- Next we have a list of
results
in this case just 1, which aliases theoutput
tensor. One drawback of DPS is that the result tensor and the output tensor will appear to have different SSA names in the IR, but they really alias the same object. This can make writing some passes more cumbersome. - Next we have
extraClassDeclaration
, which enables us to inject member functions, written directly in C++, into the generated class. We are doing this for this particular case in order to satisfy the DPS interface which requires an implementation for getting the mutated output tensor. - Finally, we have
hasVerifier = 1
, this tells MLIR that we have a verifier function that will be called to validate the operation. This is a good practice to ensure that the IR is well formed.
We can now try building and opening the TTIROps.h.inc
file to see the generated C++ code.
We will actually get a linker error because we have hasVerifier = 1
which
automatically declared a verifier function, but we need to go implement.
Let's head over to lib/Dialect/TTIR/IR/TTIROps.cpp
and implement the verifier.
// MatmulOp verification
::mlir::LogicalResult mlir::tt::ttir::MatmulOp::verify() {
::mlir::RankedTensorType inputAType = getA().getType();
::mlir::RankedTensorType inputBType = getB().getType();
::mlir::RankedTensorType outputType = getOutput().getType();
llvm::ArrayRef<int64_t> outputShape = outputType.getShape();
llvm::SmallVector<int64_t> inputAShape(inputAType.getShape());
llvm::SmallVector<int64_t> inputBShape(inputBType.getShape());
// Verify that the input A is at least 1D tensor
if (inputAType.getRank() < 1) {
return emitOpError("Input A must be at least a 1D tensor");
}
// Verify that the input B is at least 1D tensor
if (inputBType.getRank() < 1) {
return emitOpError("Input B must be at least a 1D tensor");
}
// If input A is a vector (1D tensor), 1 is prepended to its dimension for the
// purpose of the matrix multiply. After the matrix multiply, the prepended
// dimension is removed.
if (inputAType.getRank() == 1) {
inputAShape.insert(inputAShape.begin(), 1);
}
// If input B is a vector (1D tensor), a 1 is appended to its dimension for
// the purpose of the matrix-vector product and removed after.
if (inputBType.getRank() == 1) {
inputBShape.push_back(1);
}
// Verify that the input A and input B has matching inner dimensions
if (inputAShape[inputAShape.size() - 1] !=
inputBShape[inputBShape.size() - 2]) {
return emitOpError(
"Input A[-1](" + std::to_string(inputAShape[inputAShape.size() - 1]) +
") and B[-2](" + std::to_string(inputBShape[inputBShape.size() - 2]) +
") must have matching inner dimensions");
}
llvm::SmallVector<int64_t> expectedOutputShape;
// Verify that the batch dimensions are broadcast compatible and construct the
// expected output shape
if (inputAShape.size() > 2 || inputBShape.size() > 2) {
llvm::SmallVector<int64_t> inputABatchDims, inputBBatchDims;
if (inputAShape.size() > 2) {
inputABatchDims.insert(inputABatchDims.begin(), inputAShape.begin(),
inputAShape.end() - 2);
}
if (inputBShape.size() > 2) {
inputBBatchDims.insert(inputBBatchDims.begin(), inputBShape.begin(),
inputBShape.end() - 2);
}
// Verify that the batch dimensions of input A and B are broadcast
// compatible
llvm::SmallVector<int64_t, 4> broadcastedShape;
if (!OpTrait::util::getBroadcastedShape(inputABatchDims, inputBBatchDims,
broadcastedShape)) {
return emitOpError("Batch dimensions of input A(" +
ttmlir::utils::join(inputABatchDims, ",") +
") and B(" +
ttmlir::utils::join(inputBBatchDims, ",") +
") are not broadcast compatible");
}
// Insert the broadcasted batch dimensions in the expected output shape
expectedOutputShape.insert(expectedOutputShape.begin(),
broadcastedShape.begin(),
broadcastedShape.end());
}
// Insert the input A and B inner dimensions in expected output shape
// Consider the case where input A and B are vectors. In that case,
// the dimension 1 is ommited from the output shape.
if (inputAType.getRank() > 1) {
expectedOutputShape.push_back(inputAShape[inputAShape.size() - 2]);
}
if (inputBType.getRank() > 1) {
expectedOutputShape.push_back(inputBShape[inputBShape.size() - 1]);
}
// Check the case of a vector-vector product. At this moment we don't support
// scalars in IR, hence check that the output is at least 1D tensor of size 1.
if (expectedOutputShape.size() == 0) {
if (outputType.getRank() < 1) {
return emitOpError("Scalar output is not supported, output must be at "
"least a 1D tensor");
}
if (outputType.getRank() > 1 || outputType.getShape()[0] != 1) {
return emitOpError("Scalar output must be a 1D tensor of size 1");
}
return llvm::success();
}
// Verify that the output shape is correct
if (outputShape.size() != expectedOutputShape.size()) {
return emitOpError("Output shape rank(" +
std::to_string(outputShape.size()) +
") must match the expected output shape rank(" +
std::to_string(expectedOutputShape.size()) + ")");
}
// Verify each dim of the output shape
for (size_t i = 0; i < outputShape.size(); i++) {
if (outputShape[i] != expectedOutputShape[i]) {
return emitOpError(
"Output shape dimension[" + std::to_string(i) + "](" +
std::to_string(outputShape[i]) +
") doesn't match the expected output shape dimension[" +
std::to_string(i) + "](" + std::to_string(expectedOutputShape[i]) +
")");
}
}
return success();
}
2. Define the Op in the TTNN backend dialect
Next we will define the Op in the TTNN dialect. TTNN Ops are defined in the same way, but in their respective set of dialect files. Refer to the previous section for details, the process is the same.
TTNNOps.td
def TTNN_MatmulOp : TTNN_NamedDPSOp<"matmul"> {
let arguments = (ins AnyRankedTensor:$a,
AnyRankedTensor:$b,
AnyRankedTensor:$output);
let results = (outs AnyRankedTensor:$result);
let extraClassDeclaration = [{
MutableOperandRange getDpsInitsMutable() { return getOutputMutable(); }
}];
let hasVerifier = 1;
}
TTNNOps.cpp
// MatmulOp verification
::mlir::LogicalResult mlir::tt::ttnn::MatmulOp::verify() {
::mlir::RankedTensorType inputAType = getA().getType();
::mlir::RankedTensorType inputBType = getB().getType();
::mlir::RankedTensorType outputType = getOutput().getType();
llvm::ArrayRef<int64_t> outputShape = outputType.getShape();
llvm::SmallVector<int64_t> inputAShape(inputAType.getShape());
llvm::SmallVector<int64_t> inputBShape(inputBType.getShape());
// Verify that the input A is at least 1D tensor
if (inputAType.getRank() < 1) {
return emitOpError("Input A must be at least a 1D tensor");
}
// Verify that the input B is at least 1D tensor
if (inputBType.getRank() < 1) {
return emitOpError("Input B must be at least a 1D tensor");
}
// If input A is a vector (1D tensor), 1 is prepended to its dimension for the
// purpose of the matrix multiply. After the matrix multiply, the prepended
// dimension is removed.
if (inputAType.getRank() == 1) {
inputAShape.insert(inputAShape.begin(), 1);
}
// If input B is a vector (1D tensor), a 1 is appended to its dimension for
// the purpose of the matrix-vector product and removed after.
if (inputBType.getRank() == 1) {
inputBShape.push_back(1);
}
// Verify that the input A and input B has matching inner dimensions
if (inputAShape[inputAShape.size() - 1] !=
inputBShape[inputBShape.size() - 2]) {
return emitOpError(
"Input A[-1](" + std::to_string(inputAShape[inputAShape.size() - 1]) +
") and B[-2](" + std::to_string(inputBShape[inputBShape.size() - 2]) +
") must have matching inner dimensions");
}
llvm::SmallVector<int64_t> expectedOutputShape;
// Verify that the batch dimensions are broadcast compatible and construct the
// expected output shape
if (inputAShape.size() > 2 || inputBShape.size() > 2) {
llvm::SmallVector<int64_t> inputABatchDims, inputBBatchDims;
if (inputAShape.size() > 2) {
inputABatchDims.insert(inputABatchDims.begin(), inputAShape.begin(),
inputAShape.end() - 2);
}
if (inputBShape.size() > 2) {
inputBBatchDims.insert(inputBBatchDims.begin(), inputBShape.begin(),
inputBShape.end() - 2);
}
// Verify that the batch dimensions of input A and B are broadcast
// compatible
llvm::SmallVector<int64_t, 4> broadcastedShape;
if (!OpTrait::util::getBroadcastedShape(inputABatchDims, inputBBatchDims,
broadcastedShape)) {
return emitOpError("Batch dimensions of input A(" +
ttmlir::utils::join(inputABatchDims, ",") +
") and B(" +
ttmlir::utils::join(inputBBatchDims, ",") +
") are not broadcast compatible");
}
// Insert the broadcasted batch dimensions in the expected output shape
expectedOutputShape.insert(expectedOutputShape.begin(),
broadcastedShape.begin(),
broadcastedShape.end());
}
// Insert the input A and B inner dimensions in expected output shape
// Consider the case where input A and B are vectors. In that case,
// the dimension 1 is ommited from the output shape.
if (inputAType.getRank() > 1) {
expectedOutputShape.push_back(inputAShape[inputAShape.size() - 2]);
}
if (inputBType.getRank() > 1) {
expectedOutputShape.push_back(inputBShape[inputBShape.size() - 1]);
}
// Check the case of a vector-vector product. At this moment we don't support
// scalars in IR, hence check that the output is at least 1D tensor of size 1.
if (expectedOutputShape.size() == 0) {
if (outputType.getRank() < 1) {
return emitOpError("Scalar output is not supported, output must be at "
"least a 1D tensor");
}
if (outputType.getRank() > 1 || outputType.getShape()[0] != 1) {
return emitOpError("Scalar output must be a 1D tensor of size 1");
}
return llvm::success();
}
// Verify that the output shape is correct
if (outputShape.size() != expectedOutputShape.size()) {
return emitOpError("Output shape rank(" +
std::to_string(outputShape.size()) +
") must match the expected output shape rank(" +
std::to_string(expectedOutputShape.size()) + ")");
}
// Verify each dim of the output shape
for (size_t i = 0; i < outputShape.size(); i++) {
if (outputShape[i] != expectedOutputShape[i]) {
return emitOpError(
"Output shape dimension[" + std::to_string(i) + "](" +
std::to_string(outputShape[i]) +
") doesn't match the expected output shape dimension[" +
std::to_string(i) + "](" + std::to_string(expectedOutputShape[i]) +
")");
}
}
return success();
}
3. Convert / Implement the Op in the TTNN passes
Next we will implement the conversion from the TTIR matmul
Op to the TTNN matmul
Op.
This is a trivial conversion, as the Ops are identical in their semantics, so
the changeset isn't going to be very instructive, but will at least point to the
files involved. The conversion is implemented in the ConvertTTIRToTTNNPass
pass in
file lib/Conversion/TTIRToTTNN/TTIRToTTNNPass.cpp
.
Zooming into class ConvertTTIRToTTNNPass
we can see we implement the pass interface
via member function void runOnOperation() final
. This function will be called
for every operation matching the type specified in the pass tablegen file. A
quick look at include/ttmlir/Conversion/Passes.td
we can see:
def ConvertTTIRToTTNN: Pass<"convert-ttir-to-ttnn", "::mlir::ModuleOp"> {
This means that runOnOperation
will be called for every ModuleOp
in the
graph, usually there is only one ModuleOp
which serves as the root of the
graph.
Inside runOnOperation
is usually where we define a rewrite pattern set that
can match much more complicated patterns (nested inside of the ModuleOp
's
regions)
than just a single operation. In runOperation
method you will see the call to
method populateTTIRToTTNNPatterns(...)
that actually generates rewrite patterns.
Method populateTTIRToTTNNPatterns(...)
is defined
in lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp
.
patterns
.add<TensorEmptyConversionPattern,
ToLayoutOpConversionPattern,
ElementwiseOpConversionPattern<ttir::AbsOp, ttnn::AbsOp>,
ElementwiseOpConversionPattern<ttir::AddOp, ttnn::AddOp>,
ElementwiseOpConversionPattern<ttir::CbrtOp, ttnn::CbrtOp>,
ElementwiseOpConversionPattern<ttir::FloorOp, ttnn::FloorOp>,
ElementwiseOpConversionPattern<ttir::IsFiniteOp, ttnn::IsFiniteOp>,
ElementwiseOpConversionPattern<ttir::LogicalAndOp, ttnn::LogicalAndOp>,
ElementwiseOpConversionPattern<ttir::LogicalOrOp, ttnn::LogicalOrOp>,
ElementwiseOpConversionPattern<ttir::LogicalNotOp, ttnn::LogicalNotOp>,
ElementwiseOpConversionPattern<ttir::LogicalXorOp, ttnn::LogicalXorOp>,
ElementwiseOpConversionPattern<ttir::MultiplyOp, ttnn::MultiplyOp>,
ElementwiseOpConversionPattern<ttir::EqualOp, ttnn::EqualOp>,
ElementwiseOpConversionPattern<ttir::NotEqualOp, ttnn::NotEqualOp>,
ElementwiseOpConversionPattern<ttir::GreaterEqualOp, ttnn::GreaterEqualOp>,
ElementwiseOpConversionPattern<ttir::GreaterThanOp, ttnn::GreaterThanOp>,
ElementwiseOpConversionPattern<ttir::LessEqualOp, ttnn::LessEqualOp>,
ElementwiseOpConversionPattern<ttir::LessThanOp, ttnn::LessThanOp>,
ElementwiseOpConversionPattern<ttir::MaximumOp, ttnn::MaximumOp>,
ElementwiseOpConversionPattern<ttir::MinimumOp, ttnn::MinimumOp>,
ElementwiseOpConversionPattern<ttir::NegOp, ttnn::NegOp>,
ElementwiseOpConversionPattern<ttir::ReluOp, ttnn::ReluOp>,
ElementwiseOpConversionPattern<ttir::GeluOp, ttnn::GeluOp>,
ElementwiseOpConversionPattern<ttir::SqrtOp, ttnn::SqrtOp>,
ElementwiseOpConversionPattern<ttir::RsqrtOp, ttnn::RsqrtOp>,
ElementwiseOpConversionPattern<ttir::SignOp, ttnn::SignOp>,
ElementwiseOpConversionPattern<ttir::SigmoidOp, ttnn::SigmoidOp>,
ElementwiseOpConversionPattern<ttir::Log1pOp, ttnn::Log1pOp>,
ElementwiseOpConversionPattern<ttir::ReciprocalOp, ttnn::ReciprocalOp>,
ElementwiseOpConversionPattern<ttir::ExpOp, ttnn::ExpOp>,
ElementwiseOpConversionPattern<ttir::LogOp, ttnn::LogOp>,
ElementwiseOpConversionPattern<ttir::DivOp, ttnn::DivOp>,
ElementwiseOpConversionPattern<ttir::CeilOp, ttnn::CeilOp>,
ElementwiseOpConversionPattern<ttir::SinOp, ttnn::SinOp>,
ElementwiseOpConversionPattern<ttir::CosOp, ttnn::CosOp>,
ElementwiseOpConversionPattern<ttir::Expm1Op, ttnn::Expm1Op>,
ElementwiseOpConversionPattern<ttir::RemainderOp, ttnn::RemainderOp>,
ElementwiseOpConversionPattern<ttir::WhereOp, ttnn::WhereOp>,
ElementwiseUnaryWithFloatParameterOpConversionPattern<ttir::LeakyReluOp, ttnn::LeakyReluOp>,
ReductionOpConversionPattern<ttir::SumOp, ttnn::SumOp>,
ReductionOpConversionPattern<ttir::MeanOp, ttnn::MeanOp>,
ReductionOpConversionPattern<ttir::MaxOp, ttnn::MaxOp>,
BroadcastOpConversionPattern,
EmbeddingOpConversionPattern,
SoftmaxOpConversionPattern,
TransposeOpConversionPattern,
TypecastOpConversionPattern,
ClampOpConversionPattern,
ConcatOpConversionPattern,
ReshapeOpConversionPattern,
SliceOpConversionPattern,
SqueezeOpConversionPattern,
UnsqueezeOpConversionPattern,
ConstantOpConversionPattern,
MatmulOpConversionPattern,
Conv2dOpConversionPattern,
MaxPool2dOpConversionPattern,
SubtractOpConversionPattern,
AllGatherOpConversionPattern
>(typeConverter, ctx);
More information on rewrite patterns and their capabilities can be found in the MLIR documentation here and here.
For matmul, we defined a new conversion pattern that's generic to all binary ops
with arguments named a
and b
:
class MatmulOpConversionPattern : public OpConversionPattern<ttir::MatmulOp> {
public:
using OpConversionPattern<ttir::MatmulOp>::OpConversionPattern;
LogicalResult
matchAndRewrite(ttir::MatmulOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {
rewriter.replaceOpWithNewOp<ttnn::MatmulOp>(
op, this->getTypeConverter()->convertType(op.getType()), adaptor.getA(),
adaptor.getB(), adaptor.getOutput());
return success();
}
};
Invoked as part of the rewrite set:
MatmulOpConversionPattern
Note:
We also need to add this op to the C++ emitter,
lib/Conversion/TTNNToEmitC/TTNNToEmitC.cpp
see
populateTTNNToEmitCPatterns(...)
.
4. Add a compiler unit test for the Op
So far we have defined the Op in the TTIR and TTNN dialects,
implemented verifiers, and have conversion passes. Now we need to add a unit
test to ensure that the pass is working correctly. The compiler unit tests are located
in test/ttmlir/Dialect
area. In this case we'll add a test under the TTNN
subdirectory since we are testing the ConvertTTIRToTTNNPass
.
test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir
// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline %s | FileCheck %s
#any_device_tile = #tt.operand_constraint<dram|l1|tile|any_device_tile>
// CHECK: #[[TILED_LAYOUT:.*]] = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x4x!tt.tile<32x32, bf16>, #dram>, interleaved>
module attributes {} {
func.func @forward(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
%0 = tensor.empty() : tensor<64x96xbf16>
// CHECK: %[[C:.*]] = "ttnn.matmul"[[C:.*]]
%1 = "ttir.matmul"(%arg0, %arg1, %0) <{operand_constraints = [#any_device_tile, #any_device_tile, #any_device_tile]}> : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
return %1 : tensor<64x96xbf16>
}
}
Unit tests in MLIR are typically written using a tool called
FileCheck
, please refer to the llvm FileCheck documentation for a tutorial and more information about theRUN
andCHECK
directives.
A few things to point out specifically regarding tt-mlir dialects:
tt.system_desc
: This is a 1-1 mapping to theSystemDesc
flatbuffer schema that is used to describe the system configuration. This is a required attribute tagged on the top level module for all tt-mlir dialects.- Pass
--ttir-layout
is a prerequisite before runningconvert-ttir-to-ttnn
. This pass is responsible for converting the input tensors to device memory space and tile layout before lowering to TTNN. - This test is asserting that
ttir.matmul
converts tottnn.matmul
.
To run the test, you can use the following command:
cmake --build build -- check-ttmlir
You can also manually run ttmlir-opt
on the test file to see the
resulting output:
./build/bin/ttmlir-opt --ttir-load-system-desc="path=<PATH_TO_SYSTEM_DESC>" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir
5. Define flatbuffer schema for the Op
Next we will define the flatbuffer schema for the Op. The schema must capture all tensor inputs, outputs, and attributes of the Op, i.e. everything the runtime needs to execute the Op.
include/ttmlir/Target/TTNN/program.fbs
table MatmulOp {
in0: tt.target.TensorRef;
in1: tt.target.TensorRef;
out: tt.target.TensorRef;
}
Type TensorRef
, flatbuffer tables with suffix Ref
are used to represent live values
during the runtime, decoupled from the underlying Desc
suffixes which carry the
type and attribute information for the object.
We also add this new op to the union OpType
, which is the variant type for all
ops.
More information about writing flatbuffer schemas can be found in the flatbuffers documentation
6. Serialize the Op in the flatbuffer format
In the previous section we defined the flatbuffer schema for the matmul
Op, now let's put our new schema definition to use. The schema is used as input
to a program called flatc
which generates C++ code (or any language for that
matter) for serializing and deserializing the schema. This generated code can be
found in build/include/ttmlir/Target/TTNN/program_generated.h
.
Let's head over to lib/Target/TTNN/TTNNToFlatbuffer.cpp
to define
a createOp
overloaded function that does the conversion from MLIR to flatbuffer:
::flatbuffers::Offset<::tt::target::ttnn::MatmulOp>
createOp(FlatbufferObjectCache &cache, MatmulOp op) {
auto in0 =
cache.at<::tt::target::TensorRef>(getOperandThroughDPSOps(op.getA()));
auto in1 =
cache.at<::tt::target::TensorRef>(getOperandThroughDPSOps(op.getB()));
auto output = cache.at<::tt::target::TensorRef>(
getOperandThroughDPSOps(op.getResult()));
return ::tt::target::ttnn::CreateMatmulOp(*cache.fbb, in0, in1, output);
}
Lots of things are happening here, let's break it down:
FlatbufferObjectCache
: This is a helper class that is used to cache objects in the flatbuffer that are created during the serialization process. This is necessary for managing value lifetimes and identifiers, at the same time it is an optimization to avoid having multiple copies of the same object. For example, aTensorRef
with multiple uses could naively be recreated, one for each use, but with the cache we can ensure that the object is only created once and all uses point to the same flatbuffer offset. The cache is passed around to all serialization functions and should be used whenever creating a new object.getOperandThroughDPSOps
: In section 1. we discussed DPS semantics and the drawback of having the result alias the output tensor. This is one of those cases where we need to use a helper function to trace through the output operands to find the original SSA name in order to associate it with the originalTensorRef
.CreateMatmulOp
: The autogenerated function from the flatbuffer schema that actually serializes the data into the flatbuffer format.
We can finally generate a binary with our new Op! We can use the following command:
./build/bin/ttmlir-opt --ttir-load-system-desc="path=<PATH_TO_SYSTEM_DESC>" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir | ./build/bin/ttmlir-translate --ttnn-to-flatbuffer -o out.ttnn
And we can inspect the with ttrt
:
ttrt read out.ttnn
7. Add runtime support for the Op
Next, we want to add runtime support for the Op by parsing the flatbuffer and invoking the TTNN API.
runtime/lib/ttnn/operations/matmul/matmul.cpp
namespace tt::runtime::ttnn::operations::matmul {
void run(const ::tt::target::ttnn::MatmulOp *op, ProgramContext &context) {
ProgramTensorPool &tensorPool = context.getTensorPool();
const ::ttnn::Tensor &lhs = tensorPool.at(op->in0()->global_id());
const ::ttnn::Tensor &rhs = tensorPool.at(op->in1()->global_id());
DEBUG_ASSERT(lhs.is_allocated());
DEBUG_ASSERT(rhs.is_allocated());
::ttnn::DataType outputDataType = utils::getDataType(op->out());
::tt::tt_metal::MemoryConfig outputMemoryConfig =
utils::createMemoryConfig(op->out());
std::optional<
::ttnn::operations::matmul::MatmulMultiCoreReuseMultiCast1DProgramConfig>
programConfig = std::nullopt;
const std::optional<const ::tt::tt_metal::MemoryConfig> memoryConfig =
std::make_optional(outputMemoryConfig);
const std::optional<const ::ttnn::DataType> dtype =
std::make_optional(outputDataType);
::ttnn::Tensor out = ::ttnn::matmul(
lhs, rhs, /*transposeA*/ false, /*transposeB*/ false, memoryConfig, dtype,
/*programConfig*/ std::nullopt, /*activation*/ std::nullopt,
/*computeKernelConfig*/ std::nullopt, /*coreGrid*/ std::nullopt);
tensorPool.insert_or_assign(op->out()->global_id(), out);
}
} // namespace tt::runtime::ttnn::operations::matmul
A couple things to note from above:
- Most runtime op functions will follow a similar pattern, they will take in
some additional datastructures for managing the program context.
- Program context tracks the state of the current program. It stores intermediate tensors and devices.
tensorPool.at(op->in0()->global_id())
:global_id
is a unique identifier for the tensor that was generated and managed by theFlatbufferObjectCache
. This is how it's intended to be used by the runtime.- Some operations may belong to a larger set of operations. For example, any eltwise unary operations can
be added in
runtime/lib/ttnn/operations/eltwise/unary.cpp
directly without needing to create a new file.
If a new file is created for the op, we need to add a new source to runtime/lib/ttnn/operations/CMakeLists.txt
and a new case to runtime/lib/ttnn/program.cpp
.
To update runtime/lib/ttnn/operations/CMakeLists.txt
, include the path to the source file in TTNN_OPS_SRCS
:
runtime/lib/ttnn/operations/CMakeLists.txt
${CMAKE_CURRENT_SOURCE_DIR}/matmul/matmul.cpp
To update runtime/lib/ttnn/program.cpp
, add a new case to the runOperation
method of ProgramExecutor
:
runtime/lib/ttnn/program.cpp
case ::tt::target::ttnn::OpType::MatmulOp: {
return operations::matmul::run(op->type_as_MatmulOp(), context);
}
We can test our changes with ttrt
(don't forget to rebuild ttrt
):
ttrt run out.ttnn
8. Add a silicon unit test for the Op
After adding runtime support, we're ready to test our Op on silicon. All silicon tests are located
under test/ttmlir/Silicon
. The process is similar to adding a compiler unit test.
In our specific case, we create a unit test here: test/ttmlir/Silicon/TTNN/simple_matmul.mlir
:
test/ttmlir/Silicon/TTNN/simple_matmul.mlir
// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%" %s > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn
#any_device_tile = #tt.operand_constraint<dram|l1|tile|any_device_tile>
// CHECK: #[[TILED_LAYOUT:.*]] = #ttnn.ttnn_layout<(d0, d1) -> (d0, d1), <1x1>, memref<2x4x!tt.tile<32x32, bf16>, #dram>, interleaved>
module attributes {} {
func.func @forward(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
%0 = tensor.empty() : tensor<64x96xbf16>
// CHECK: %[[C:.*]] = "ttnn.matmul"[[C:.*]]
%1 = "ttir.matmul"(%arg0, %arg1, %0) <{operand_constraints = [#any_device_tile, #any_device_tile, #any_device_tile]}> : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
return %1 : tensor<64x96xbf16>
}
}
Couple things to point out about this process:
- Tests placed under
test/ttmlir/Dialect
will only test the compiler's capability of compiling the module. If you want the module to run on silicon in CI, the test must be placed undertest/ttmlir/Silicon
. - Notice the differences between the compilation headers of
test/ttmlir/Silicon/TTNN/simple_matmul.mlir
andtest/ttmlir/Dialect/TTNN/matmul/simple_matmul.mlir
--ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%"
: Thesystem-desc-path
option specifies the location of the system descriptor required for compiling the module. This is crucial for silicon tests, as modules compiled with different system descriptors may vary in silicon compatibility. Ensuring the system descriptor accurately reflects the target hardware is essential for running the module correctly.// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn
: This runsttmlir-translate
that serializes the output mlir module to a flatbuffer binary. We added the logic for this serialization in the Serialize the Op in the flatbuffer format section.
Decomposing an Op in TTIR
This guide explains how to add and decompose a new operation in the TTIR dialect. We’ll focus on adding an Index
operation, which will be decomposed into the Slice
operation. The decomposition is implemented as a conversion pass in MLIR since it allows us to mark operations or dialects as legal or illegal, type conversion...
This guide will cover the following steps:
1. Define the Op in the TTIR frontend dialect
The more information regarding this step can be found here: Define the Op in the TTIR frontend dialect
I updated the TTIROps.td
as following:
def TTIR_IndexOp: TTIR_DPSOp<"index"> {
let summary = "Index op.";
let description = [{
Extract a sub-tensor (slice) from the input tensor along a specified dimension.
The `begin`, `end`, and `step` attributes define the start, stop, and step indices for the
selected dimension (`dim`) of the tensor.
}];
let arguments = (ins AnyRankedTensor:$input,
AnyRankedTensor:$output,
I32Attr:$dim,
I32Attr:$begin,
I32Attr:$end,
I32Attr:$step,
TT_OperandConstraintArrayAttr:$operand_constraints);
let results = (outs AnyRankedTensor:$result);
let extraClassDeclaration = [{
MutableOperandRange getDpsInitsMutable() { return getOutputMutable(); }
}];
let hasVerifier = 1;
}
The verification function has been added as well:
// IndexOp verification
::mlir::LogicalResult mlir::tt::ttir::IndexOp::verify() {
::mlir::RankedTensorType inputType = getInput().getType();
::llvm::ArrayRef<int64_t> inputShape = inputType.getShape();
::mlir::RankedTensorType outputType = getOutput().getType();
int32_t dim = getDim();
int32_t begin = getBegin();
int32_t end = getEnd();
int32_t step = getStep();
// Verify that the input is at least 1D tensor
if (inputType.getRank() < 1) {
return emitOpError("Input must be at least a 1D tensor");
}
// Validate that the output tensor has the same element type as the input
// tensor
if (inputType.getElementType() != outputType.getElementType()) {
return emitOpError(
"Output tensor must have the same element type as the input tensor");
}
// Verify the output tensor rank
if (inputType.getRank() != outputType.getRank()) {
return emitOpError(
"Output tensor must have the same rank as the input tensor");
}
// Verify that the dim attribute is within the bounds of the input tensor
if (dim < 0 || dim >= inputType.getRank()) {
return emitOpError() << "Invalid dimension index " << dim
<< ". Input tensor rank is " << inputType.getRank();
}
// Verify begin, end, step and the output tensor dimensions
int64_t dimSize = inputShape[dim];
// Adjust negative begin and end
int32_t adjustedBegin = (begin < 0) ? (begin + dimSize) : begin;
int32_t adjustedEnd = (end < 0) ? (end + dimSize) : end;
std::ostringstream inputShapeStream;
inputShapeStream << "(";
for (size_t i = 0; i < inputShape.size(); ++i) {
inputShapeStream << inputShape[i];
if (i != inputShape.size() - 1) {
inputShapeStream << ", ";
}
}
inputShapeStream << ")";
std::string inputShapeStr = inputShapeStream.str();
if (adjustedBegin < 0 || adjustedBegin >= dimSize) {
return emitOpError() << "Invalid begin index for dimension "
<< std::to_string(dim) << ". Expected value in range ["
<< std::to_string(-dimSize) << ", " << dimSize
<< "), got " << begin
<< ". Input shape: " << inputShapeStr;
}
if (adjustedEnd < 0 || adjustedEnd > dimSize) {
return emitOpError() << "Invalid end index for dimension "
<< std::to_string(dim) << ". Expected value in range ["
<< std::to_string(-dimSize) << ", " << dimSize
<< "], got " << end
<< ". Input shape: " << inputShapeStr;
}
auto formatValueMessage = [](int value, int adjustedValue) {
return value < 0 ? std::to_string(adjustedValue) + " (" +
std::to_string(value) + ")"
: std::to_string(value);
};
std::string beginValueMessage = formatValueMessage(begin, adjustedBegin);
std::string endValueMessage = formatValueMessage(end, adjustedEnd);
if (step == 0) {
return emitOpError("Step value for dimension " + std::to_string(dim) +
" cannot be zero");
}
if (step > 0 && adjustedBegin > adjustedEnd) {
return emitOpError() << "For positive step, begin index must be less "
"than or equal to end index for dimension "
<< dim << ". Got begin: " << beginValueMessage
<< ", end: " << endValueMessage << ", step: " << step
<< ", input shape: " << inputShapeStr;
}
if (step < 0 && adjustedBegin < adjustedEnd) {
return emitOpError() << "For negative step, begin index must be greater "
"than or equal to end index for dimension "
<< dim << ". Got begin: " << beginValueMessage
<< ", end: " << endValueMessage << ", step: " << step
<< ", input shape: " << inputShapeStr;
}
// Calculate the expected size of the output dimension
int32_t expectedDimSize =
(std::abs(adjustedEnd - adjustedBegin) + std::abs(step) - 1) /
std::abs(step);
if (outputType.getDimSize(dim) != expectedDimSize) {
return emitOpError() << "Mismatch in dimension " << std::to_string(dim)
<< " of the output tensor: expected size "
<< expectedDimSize << ", but got "
<< outputType.getDimSize(dim);
}
return success();
}
2. Create a conversion pattern
A conversion pattern defines how MLIR should rewrite the Op. It can be implemented in either C++ or TableGen. Currently, we only have the C++ implementation; TableGen format will be added in the future.
C++ conversion pattern
For the Index
operation, we use the C++ conversion pattern because it involves changing the Op’s input types from integers to arrays, which TableGen lacks flexibility for.
// This transformation adjusts IndexOp attributes so that `begin`, `end`, and
// `step` become arrays, where each array element corresponds to a dimension of
// the input tensor. For dimensions other than the sliced dimension, default
// values are used.
//
struct IndexToSliceConversionPattern
: public OpConversionPattern<ttir::IndexOp> {
using OpConversionPattern<ttir::IndexOp>::OpConversionPattern;
LogicalResult
matchAndRewrite(ttir::IndexOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {
auto inputType =
::mlir::dyn_cast<mlir::RankedTensorType>(adaptor.getInput().getType());
if (!inputType || !inputType.hasRank()) {
return failure();
}
int64_t rank = inputType.getRank();
llvm::SmallVector<mlir::Attribute, 4> begins, ends, steps;
for (int64_t i = 0; i < rank; ++i) {
if (i == op.getDim()) {
begins.push_back(rewriter.getI32IntegerAttr(adaptor.getBegin()));
ends.push_back(rewriter.getI32IntegerAttr(adaptor.getEnd()));
steps.push_back(rewriter.getI32IntegerAttr(adaptor.getStep()));
} else {
begins.push_back(rewriter.getI32IntegerAttr(0));
ends.push_back(rewriter.getI32IntegerAttr(inputType.getDimSize(i)));
steps.push_back(rewriter.getI32IntegerAttr(1));
}
}
auto newOp = rewriter.create<ttir::SliceOp>(
op.getLoc(), op.getType(), adaptor.getInput(), adaptor.getOutput(),
rewriter.getArrayAttr(begins), rewriter.getArrayAttr(ends),
rewriter.getArrayAttr(steps), adaptor.getOperandConstraints());
rewriter.replaceOp(op, newOp.getResult());
return success();
}
};
The matchAndRewrite
method from OpConversionPattern
is implemented to replace the matched Op with the newly created Op. Since decomposition is implemented as a conversion pass, OpAdaptor
is used to access the attributes of the original Op in their converted types. Finally, we instantiate the new Op and call the replaceOp
method on ConversionPatternRewriter
to replace the original Op.
Tablegen conversion pattern
TODO
3. Register the created conversion pattern
To register the new pattern, go to the populateTTIRToTTIRDecompositionPatterns
function in TTIRToTTIRDecomposition.cpp
and add it to RewritePatternSet
using the add method. After that is done you should mark the decomposed op as illegal in runOnOperation
method of TTIRToTTIRDecompositionPass
in TTIRToTTIRDecompositionPass.cpp
.
You should also add a silicon test like described here: Add a silicon unit test for the Op. This is how the silicon test for the Index
operation looks like:
// RUN: ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=%system_desc_path%" %s > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn
#any_device_tile = #tt.operand_constraint<dram|l1|interleaved>
module attributes {} {
func.func @forward(%arg0: tensor<4x32x32xbf16>) -> tensor<4x32x16xbf16> {
%0 = tensor.empty() : tensor<4x32x16xbf16>
// CHECK: %[[C:.*]] = "ttnn.slice"[[C:.*]]
%1 = "ttir.index"(%arg0, %0) <{dim = 2: i32, begin = 0: i32, end = 32: i32, step = 2: i32, operand_constraints = [#any_device_tile, #any_device_tile]}> : (tensor<4x32x32xbf16>, tensor<4x32x16xbf16>) -> tensor<4x32x16xbf16>
return %1 : tensor<4x32x16xbf16>
}
}
Doxygen
This is a link to a doxygen autogenerated code reference. Doxygen
Build Instructions
To build Doxygen use the doxygen target in CMake
cmake -B build
cmake --build build -- doxygen
Specifications
Specifications are documents that define the requirements for features or concepts that are particularly cross-cutting, complex, or require a high degree of coordination and planning. They are intended to be a living document that evolves as the feature is developed and should be maintained as the goto reference documentation for the feature or concept.
Specifications are written in markdown and are stored in the docs/src/specs
directory of the repository. Below is a template that should be used when
creating a new specification.
Specification Template
# [Title]
A brief description of the feature or concept that this specification is
defining.
## Motivation
A description of why this feature or concept is needed and what problem it is
solving. This section is best written by providing concrete examples and use
cases.
## Proposed Changes
A list of the components that will be impacted by this spec and a detailed
description of the changes that will be made to each respective component.
It should also call out any interactions between components and how they might
share an interface or communicate with each other.
## Test Plan
A brief description of how the feature or concept will be tested.
## Concerns
A list of concerns that have been identified during the design of this feature.
Runtime Stitching
Runtime stitching adds the ability for the runtime to stitch together multiple, indepently compiled programs together at runtime, ie. without compiler knowledge of how the binary programs will be composed.
Motivation
In order to flexibly support arbitrary training schedules / composing multiple models together we want to have the ability for the runtime to stitch graphs together. To achieve this we need to define an ABI kind of interface between the compiler and the runtime.
Simple Example
mod_a = forge.compile(PyTorch_module_a)
mod_b = forge.compile(PyTorch_module_b)
for i in range(10):
outs_a = mod_a(ins_a)
outs_b = mod_b(outs_a)
mod_a
and mod_b
are 2 independent compile steps, during the compile step for
mod_a
it should be completely unaware that mod_b
will take place and vice-versa.
In order to achieve this we propose a new runtime concept called stitching:
- forge invokes compile step for
mod_a
, tt-mlir compiler determines where the inputs (ins_a
) should live, host, device dram, device l1. tt-mlir returns metadata to forge describing where it wants the tensors to reside before invoking flatbuffer submission. - forge invokes compile step for
mod_b
, same happens as bullet 1 mod_a
is invoked at runtime, forge runtime needs to inspect the compiler metadata to determine where the tensors should live. Runtime manually invokes a new data copy command to get the tenors to the correct memory space / correct memory address.- forge runtime invokes
mod_a
program submit mod_b
is invoked at runtime, this time it might be that the compiler left the tensor outputs in L1, so no data copy is needed to start runningmod_b
since the inputs are already in the correct location.
A more concrete usecase would be a training loop where there are often multiple graphs composed together. #82 Or when we eventually support torch 2.0, the torch runtime can arbitrarily break the graph anywhere.
Proposed Changes
Compiler Metadata
Compiler will encode the input tensor layout information directly into the flatbuffer tensor desc. The flatbuffer schema already exists to express this, we just need to adopt populating it instead of assuming a canonical host layout.
Compiler will decide where the tensors should live, host, device dram, device l1.
Runtime
- Runtime will inspect the tensor desc metadata to determine where the tensors need to end up / what layout they should be in before invoking the program.
- New runtime API
Tensor toLayout(Tensor tensor, ::tt::target::TensorDesc* tensorDesc);
- Runtime will need to invoke
toLayout
on all input tensors before invoking the program.
Test Plan
- Add a new test to the runtime gtest suite that verifies the runtime can correctly stitch together 2 independently compiled programs.
Concerns
- Tensors pass through device memory spaces (dram, L1) will have a dynamic address, some arbitrary run order of flatbuffer could cause tensors to end up in non-ideal locations in memory. Specifically, L1, a poorly placed tensor might not be able to be moved to a better location without a bounce through DRAM.
Tensor Layout
The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.
Motivation / High level goals
- Logical shapes: Keep the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
- Flexible sharding: Enable flexibility in choosing grid shape, to get better parallelization and avoid resharding. This is particularly important in cases where tensor shapes are not clean powers of two and would otherwise force our hand in choosing non-optimal grid shapes.
- Logical-Physical Isomorphism: Encode this information with just a few attributes to enable derived conversions from logical to physical layout and back.
- Explicit: A single source of truth.
- Enable a direct way to query padded regions.
An Example / Walkthrough
Let's consider a snippet of MLIR:
tensor<2x3x64x128xf32>
Here we've defined a 4 dimensional tensor using MLIR's builtin tensor type. This tensor type has an optional attribute called an Encoding, this attribute has been used by the TT dialect to encode the tensor's layout. This looks like:
tensor<2x3x64x128xf32,
#tt.layout<
(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
undef,
<1x1>,
memref<384x128xf32, #tt.memory_space<l1>>
>
>
At the time of this writing there are 4 properties that make up a tensor layout:
linear
: An affine map that defines how the logical tensor dimensions map to a grid shape. Note that the number of dims in the affine map must match exactly the rank of the original tensor, and the number of results must match exactly the rank of the grid shape.oob_val
: A tracked out of bounds value that fills padding space.grid
: The grid shape that this tensor is divided onto.memref
: A memref that describes the physical footprint allocation of the shard. It must also have a shape with rank equal to grid.
This example isn't particularly complicated because it's only sharded to a 1x1 grid, the rest of the document will go into more details on the following topics:
Before we jump into more advanced topics there are two resources that could be useful to have at hand:
test/python/tensor_layout.py
: Python test with many convenience functions for creating and experimenting with tensor layouts.- TTNN Interactive Visualizer: An interactive visualation tool that demonstrates the transformation. Note that this tool was created for TTNN tensor layout, but many of the same concepts transfer over.
Dimension Collapsing
Probably the most important concept in tt.layout
is dimension collapsing.
This is captured by the affine map linear
property which provides a
mapping from tensor dim space to a reduced physical dimensional space. This
single-handedly touches on most of the tensor layout goals mentioned at the
beginning of the doc:
- Leaves tensor shapes intact
- Logical-Physical mapping, how the tensor is laid out in memory over a grid
- Enables more flexible sharding
- Explicit padding
To see how these goals are achieved we'll continue working on an explicit example, same one as above:
(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3)
To recap, we have our example 4d tensor (2, 3, 64, 128)
, which maps directly to the
LHS (d0, d1, d2, d3)
. We have our 2d grid shape (1, 1)
, notice the
affine-map RHS is also 2d, and this describes how tensor dims map to a lower dimensional
physical memory, overlaid on a grid. We'll see how this gets divided onto the grid later,
but first let's look at how this forms an affine-map iteration space. If we
index our tensor at say [1, 1, 6, 100]
, we can simply plugin those numbers
to get our remapped offset:
(1 * 192 + 1 * 64 + 6, 100) = (262, 100)
This remapped offset (262, 100)
corresponds to the row and column index of the
collapsed physical memory.
By default, the dim range [0, -1)
is collapsed, but the tt.layout
contructor
can actually take a programmable range called collapseIntervals
.
collapseIntervals
is a list of pairs, where each pair is a dim range interval,
left inclusive, right exclusive. Let's consider a few examples:
Instead of multiplying out real shapes, we will use
<>
to represent a dimension join operator.
- 3D tensor onto a 2D grid and default
collapseIntervals=[(0, -1)]
:
(d0, d1, d2) -> (d0 <> d1, d2)
- 4D tensor onto a 3D grid and
collapseIntervals=[(1, -1)]
:
(d0, d1, d2, d3) -> (d0, d1 <> d2, d3)
- 4D tensor onto a 3D grid and
collapseIntervals=[(0, 2)]
:
(d0, d1, d2, d3) -> (d0 <> d1, d2, d3)
- 7D tensor onto a 4D grid and
collapseIntervals=[(0, 3), (-3, -1)]
:
(d0, d1, d2, d3, d4, d5, d6) -> (d0 <> d1 <> d2, d3, d4 <> d5, d6)
Multi-core
Let's consider the original example again, but on a larger grid than 1x1
, say 2x4
:
tensor<2x3x64x128xf32,
#tt.layout<
(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
undef,
<2x4>,
memref<192x32xf32, #tt.memory_space<l1>>
>
>
The number of affine map results, grid shape, and memref shape all must have the same rank. We can see in this example by changing the grid shape we also changed the memref shape, we can always calculate the memref shape by plugging in the full tensor dims into our affine map and then dividing by grid shape.
(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
(2 - 1, 3 - 1, 64 - 1, 128 - 1) = (1 * 192 + 2 * 64 + 63, 127) = (383, 127)
Above we actually subtracted 1 in order to get the index of the last element of the tensor. Now we can simply add back 1 to get the size:
(383 + 1, 127 + 1) = (384, 128)
Finally, we divide the dims by the respective grid dims:
(384 / 2, 128 / 4) = (192, 32)
Here's a few more example mlir snippets:
tensor<8x300xf32,
#tt.layout<(d0, d1) -> (d0, d1),
undef,
<1x2>,
memref<8x150xf32, #tt.memory_space<l1>>
>
>
tensor<8x96x32xf32,
#tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
undef,
<2x1>,
memref<384x32xf32, #tt.memory_space<l1>>
>
>
tensor<8x96x32xf32,
#tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
undef,
<2x1x2>,
memref<384x96x16xf32, #tt.memory_space<l1>>
>
>
tensor<5x3x2x2x7x32x32xf32,
#tt.layout<
(d0, d1, d2, d3, d4, d5, d6)
-> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
undef,
<3x2x2x2>,
memref<4480x4x16x16xf32, #tt.memory_space<l1>>
>
>
A couple of final notes regarding grid shape:
- Grid shapes of rank > 2 are
perfectly legal. Not only it this useful for describing multi-device grid
topologies, but it is often convenient to have higher ranked grids to better
describe how a high rank tensor should be divided. The grid shape here is a
virtual grid shape, the
tt.device
attribute will hold an additional affine map that defines how this virtual grid shape maps to a physical one. - Grid shapes where either columns or rows are > physical device grid is also
legal. Since this is only a virtual grid shape we could have some grid
1x64
that maps to a physical8x8
device grid (this particular example is called width sharding in TTNN).
Tilized
A tilized tensor is one with a memref that has a tile element type.
Given some tensor with scalar layout:
tensor<3x64x128xf32,
#tt.layout<
(d0, d1, d2) -> (d0 * 64 + d1, d2),
undef,
<3x2>,
memref<64x64xf32, #tt.memory_space<l1>>
>
>
After tilizing we'll have:
tensor<3x64x128xf32,
#tt.layout<
(d0, d1, d2) -> (d0 * 64 + d1, d2),
undef,
<3x2>,
memref<2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
Notice the memref
dim was ceilDiv
'd by tile shape and the element type becomes
a tt.tile
type. Also notice that the tensor shape and element type remains
intact.
Padding
Padding can be a bit of an overloaded term, but in this context it refers to an
out of bounds area in the physical memory allocation that has no real tensor
data in it. The contents of this area is tracked by oob_val
and the padding
area can be automatically derived from the attributes of tt.layout
.
Padding is a necessary evil that arises when a tensor is not evenly divisible by a grid shape or tile shape. It can also arise due to minimum Noc addressing requirements.
Example of non-divisible grid:
tensor<53x63xf32,
#tt.layout<
(d0, d1) -> (d0, d1),
undef,
<3x2>,
memref<18x32xf32, #tt.memory_space<l1>>
>
>
The grid dims always ceilDiv
the affine map results, real tensor data will
entirely fill initial shards and the last shard in each dimension will be
partially filled.
In this particular example, we have 1 scalar row of padding on the last row of cores and 1 scalar column of padding on the last column of cores.
Taking the above example a step further, we could tilize it:
tensor<53x63xf32,
#tt.layout<
(d0, d1) -> (d0, d1),
undef,
<3x2>,
memref<1x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
Tile dims also always ceilDiv
the resulting memref
shape. Notice now that
the padding is slightly more complicated. Our scalar shard shape was 18x32
,
but this was further padded to 32x32
meaning that every core now has 14 rows
of padding except for the last row of cores which has 15 rows of padding.
Also note that there is an order of operations here, grid divides the scalar shape first and then we tilize. This is important because it can enable use cases that frequently arise in conv networks that would otherwise result in reshards in between every layer.
With affine map we can be even more flexible in how we pad, we can bump our
stride between dimensions. Consider tensor (w/ batch dim 2
):
tensor<2x8x32xf32,
#tt.layout<
(d0, d1, d2) -> (d0 * 8 + d1, d2),
undef,
<1x2>,
memref<16x16xf32, #tt.memory_space<l1>>
>
>
If we tilized the above tensor we'd end up with a memref
shape of
1x1x!tt.tile<32x32>
, that is, all batches are tightly packed within a single
tile. Let's say that for some reason, we do not want the batches (2
) to be
tightly packed within a tile, perhaps the mathematical operation we're doing
requires the batch to be independently evaluated and thus the (S)FPU needs them
in separate tiles. We can adjust this by adjusting the stride of the affine
map:
(d0, d1, d2) -> (d0 * 32 + d1, d2),
Instead of striding by the number of logical rows, 8
, we bump the stride up to
32
effectively pushing a gap between the collapsed rows and enabling each
batch to fall on a tile boundary.
Memory Spaces
At the time of writing this document there are 4 memory spaces:
System
: Host memory space that is not device visible.SystemMMIO
: Host memory space that is device visible.DeviceDRAM
: DRAM local to the device.DeviceL1
: SRAM on each core.
Something worth noting here is that a tensor must belong exclusively to only
one of these memory spaces at a time. For example, in order to stream tensor
data from DeviceDRAM
to DeviceL1
you would need to either manually slice the
tensor into smaller tensors that do fit in L1 or have native support in the op's
kernel for double buffering a block (most TTNN ops already support this).
Multi-device
Multi-device can be naturally represented via a combination of two concepts
already touched on above, higher ranked grids and collapseIntervals
. Let's
consider the following example with a 3d grid and collapseIntervals=[(1, -1)]
.
tensor<2x3x64x128xf32,
#tt.layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
undef,
<2x2x4>,
memref<1x3x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
Here we've left the batch dim intact and started collapsing
at d1
. This enables us to define a 3d grid where the outermost grid dim
divides the batch directly. This could map to a 2 device system where the batch
dim is evenly divided between 2 devices. Within each device this op runs on a
2x4
grid.
The high level takeaway here is that how a tensor is logically divided up is decoupled from its mapping to physical compute resources. This has a nice property that data parallel extends to any tensor dimension and is captured under the same grid primitive that also divides tensor rows and columns.
Test Plan
test/python/tensor_layout.py
: Assertions forLayoutAttr
to make sure it's spec compliant.- Sweep tests:
- Grid dim sweeps
- Tilize / untilize sweeps
- Padding sweeps
- Multi-device tests
Concerns
tt.layout
is deliberately flexible and tries to capture as many problematic use-cases we've ran into in the past in a single, succinct representation. This flexibility will need to be further constrained by backends to avoid unsupported programming of this attribute.- Optimization solution space is potentially large with all of this flexibility.
Two things that I hope can help protect us here:
- By and large the heuristic we'll be following is just max the grid at all costs. This should really narrow down the solution space to only a handful of options and we only keep exploring if producers/consumers end up with nasty reblocking.
- We can constrain the optimizer heuristics as aggressively as possible in the beginning and just advertise the full flexible options to the UI model explorer. Hopefully this enables us to experiment with crazier grid layouts and prove it's worthwhile before writing an algorithm.
TTNN Tensor Layout
The above section of this document covers how the compiler models tensor layout. There are some slight differences in TTNN, but the high level idea of collapsing dims is still used.
Terms
shape
: Always logical shape, n-dimensionalstride
: Same as pytorch stride, but this is crucial for describing how n-dimensional data gets packed into a 2D physical layout. This 2D physical layout is always the inner dim (-1) wide and dims [0, N-1] are collapsed into rows derived from strideshard_shape
: Also a logical shape, describes a 2d region that chunks physical_shape . Note this does not need to be a tile multiplephysical_shard_shape
: The shard_shape padded out to tile_shapetile_shape
: A programmable tile shape, though constraints must check that it's compatible with an op's usage, i.e. FPU/Noc compatiblegrid_shape
: [divup(stride[0] // stride[-2], shard_shape[0]), divup(stride[-2], shard_shape[0])]
Mapping from the compiler
The compiler uses an affine map to explicitly track which dimensions are folded
together, but TTNN does not have affine maps so the representation is a bit more
implicit. TTNN captures the dimension collapsing in the stride
attribute
where dimensions [0, N-1]
are always collapsed. This is less flexible so the
compiler will have to enforce only collapsing supported dimensions when
targeting TTNN, or handle lowering in a different way. For example, in the
compiler we might want to represent data parallel over the tensor batch dim by
leaving d0
and collapsing d1 - d[-1]
. TTNN doesn't support this in its
tensor layout representation, but this could be lowered to a TTNN mesh tensor
where the mesh could be sliced on the batch and each per-device tensor has d0
fully collapsed.
TTNN Example
TTNN Interactive Visualizer
Device
Device in tt-mlir is somewhat of an overloaded term and can refer to different
things depending on the context. This document will only speak to the compiler's
abstract representation of a device captured by attribute #tt.device
.
Terms
There are many overloaded terms when talking about devices and grids, this document will use the following definitions:
- Physical Grid: A 2D array of tensix cores on a chip.
- Chip: A single physical chip with a Physical Grid of cores.
- Card: A PCIE or Ethernet card that may contain multiple Chips.
- System: A collection of Cards that are usually connected together on the
same host via PCIE or networked via ethernet. A system is represented by
SystemDesc
in the compiler. - Device: Device is always presented as a single entity to the enclosing scope, but it may be virtualized to abstract a multi-card System and part of its encoding carries a Logical Grid. Another way to think of device is a view over the system.
- Logical Grid or just Grid: Is a logical shape that abstracts one or more Physical Grids.
- Mesh Shape: Describes the virtual layout of the chips with respect to each other. In practice the mesh shape is used to derive the logical grid.
Motivation
The device attribute strives to achieve the following goals:
- Provide a convenient representation of a physical grid that decouples the logical division of tensors from the physical layout of the hardware. This not only simplifies reasoning about how tensors get divided into shards, but can also enable reinterpretations of the device grid for data layout optimization decoupled from the existing encoding of the tensor layouts.
- Following the first point, the device attribute should be able to represent many different forms of logical grids, from simple 2D grids, to more complex topologies like extra-wide grids or higher dimensional grids.
- Device attribute captures encoding both single chip and multi-chip systems under a single, virtualized representation.
- Enable many forms of data parallel execution strategies for single and multi chip systems under a single representation.
Scope
This document will cover how the device attribute is encoded and how it can be lowered to backend dialects. The document will not cover the algorithm for choosing the best, or even legal, device configurations for a given physical system.
Examples
All of the following examples will assume the physical hardware has an 8x8 physical
grid of cores. We will use notation [N, 8x8]
to represent a N
chip system,
each with an 8x8 physical grid.
#tt.device
in is simplest, single chip form [1, 8x8]
, just maps directly 1-1 to the
underlying physical hardware device.
#tt.device<
workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>,
meshShape = 1,
chipIds = [0]
>
Let's break down what each of these attributes mean:
workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>
: This is a 2D logical grid with dim 8x8. It's followed by an affine map(d0, d1) -> (0, d0, d1)
that provides a mapping from the logical grid to the physical grid. In this case, the logical grid is the same as the physical grid, so the mapping is the identity function. The logical grid can have any rank, but the physical mapping is always 3D, with the first being the chip index, followed by the 2D physical core index within the chip.meshShape = 1
: A shape provided as part of theDeviceAttr
constructor that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.chipIds = [0]
: This is a list of chip indices. These chip indices directly reference the same chip indices in the system descriptor. TheSystemDesc
attribute that this is in reference to is tagged on the top levelModuleOp
.
Specific examples that this document will cover:
- Data Parallel Over Batch
- Data Parallel Over 2d
- Data Parallel Over 2d and Batch
- Pipeline Parallel
- Reinterpreted Grids (Transpose)
- Reinterpreted Grids (Training Usecase)
- Reinterpreted Grids (Extra)
Before we move on to more complex examples, it's worth having on hand:
- The python test
test/python/device_attr.py
which shows how all of these examples can actually be programmed for the device attribute.- The Tensor Layout spec as the following examples will demonstrate how tensor layout interacts with the logical device grid.
Note on Data Parallel: There is existing literature that explicitly distinguishes between data parallel and tensor parallel, oftentimes describing data parallel as duplicating the model across multiple devices and trivially dividing up the batch whereas tensor parallel refers to tensor data being distributed and potentially communicated between devices during execution. While this is true for multi-GPU/CPU systems, it is somewhat of an implementation detail and given the flexibility of tenstorrent hardware there is an opportunity to generalize this concept. In this document we will use the term data parallel to refer to any form of parallelism that divides any dimension of the tensor across multiple cores/chips.
Note on Constraints: Many of the examples below require careful virtualization of the underlying physical system, i.e. some device configurations might only work if the chips are connected via ethernet and with a particular topology, but these constraints are outside the scope of the examples and will be discussed further in the Backend Lowering and Constraints section.
Data Parallel Over Batch
Given a 2 chip system, [2, 8x8]
, we can represent a simple data parallel
logical grid that divides the batch dimension in half across the two chips.
This is denoted by meshShape = 2x1x1
which means the logical grid is 3D.
#tt.device<
workerGrid = #tt.grid<2x8x8, (d0, d1, d2) -> (d0, d1, d2)>,
meshShape = 2x1x1,
chipIds = [0, 1]
>
The affine map here is just identity, so dims d1
and d2
directly index
the physical grid and d0
indexes the chip.
Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:
tensor<16x3x64x128xf32,
#tt.layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
undef,
<2x2x4>,
memref<8x3x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
If we map this tensor onto the above device, it will span across both chips, half of the batch dimension on each chip. Within each chip the tensor occupies a 2x4 grid out of the 8x8 physical grid available.
Data Parallel Over 2d
In this example we will consider a 2 chip system, [2, 8x8]
, and view it as
though the two chips are concatenated together side by side to form a single
8x16
grid. This is denoted by meshShape = 1x2
which means to concatenate
the chips in the second dimension.
#tt.device<
workerGrid = #tt.grid<8x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0, d1 mod 8)>,
meshShape = 1x2,
chipIds = [0, 1]
>
Here we can see that the affine map encodes an indexing pattern such that when we extend past 8 cores in the second dimension, we wrap around to the next chip.
Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:
tensor<256x1024xf32,
#tt.layout<(d0, d1) -> (d0, d1),
undef,
<4x16>,
memref<2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
This single tensor maps trivially onto the logical grid, spanning the upper half. Decoupled from the tensor's layout, under the hood the tensor is actually physically spanning across two chips.
Data Parallel Over 2d and Batch
The previous 2 examples can be composed together to form a logical grid that
divides tensor across multiple dimensions. Here we will consider a 4 chip
system [4, 8x8]
and view it as a 2x8x16
grid. Note that the meshShape
is
2x1x2
which means to concatenate the chips in the first and third dimensions.
#tt.device<
workerGrid = #tt.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
meshShape = 2x1x2,
chipIds = [0, 1, 2, 3]
>
We can evaluate the affine map to see that the chips are interpreted in chunks of
two, where groups [0, 1]
and [2, 3]
each form 8x16 grids and these 2 groups
concatenate to form a 2x8x16 grid.
We can consider the following tensor to map onto this grid:
tensor<64x256x1024xf32,
#tt.layout<(d0, d1) -> (d0, d1),
undef,
<2x4x16>,
memref<32x2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
Pipeline Parallel
Pipeline parallel in the scope of this spec isn't particularly interesting, it
is intended to be used in conjunction with the ttir.pipeline
operation which
will group sections of the module's operations into groups to form pipeline regions
and will be covered in a separate spec.
What we can demonstrate here is how we can take multiple non-overlapping views of the system descriptor to form distinct virtual devices.
Given an 8 chip system [8, 8x8]
, we can form two virtual devices that each
take 4 chips and interpret them differently (though they could take the same
logical grid).
#tt.device<
workerGrid = #tt.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
meshShape = 2x1x2,
chipIds = [0, 1, 2, 3]
>
#tt.device<
workerGrid = #tt.grid<16x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0 mod 8, d1 mod 8)>,
meshShape = 2x2,
chipIds = [4, 5, 6, 7]
>
Reinterpreted Grids (Transpose)
One particularly interesting usecase that logical grids could enable is to reinterpret the grid as a form of data layout optimization. For example, if we wanted to transpose a tensor, instead of having to move the data around to implement transpose, we could instead reinterpret the grid as being transposed, leveraging the fact that the relevant data is already located on the correct cores/chips.
To keep things simple, let's consider a 1 chip system [1, 8x8]
, but it's not
too big a leap to see how this could map to multi-chip where the cost of moving
data is even higher.
Let's also consider a simple (totally contrived) eltwise unary graph:
a = exp(a)
aT = transpose(a)
relu(aT)
- We'll establish a regular, single chip, identity logical grid:
#tt.device<
workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>,
meshShape = 1,
chipIds = [0]
>
- Execute
exp
. - We'll reinterpret the grid as transposed:
#tt.device<
workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d1, d0)>,
meshShape = 1,
chipIds = [0]
>
- Execute
transpose
. Note that each core only needs to transpose their data locally. Eventually this could be implemented as a no-op by reindexing the tile visitation order of the successive operation. - Execute
relu
.
It's important to note that we effectively implemented transpose without moving data anywhere.
Reinterpreted Grids (Extra)
For the sake of examples, here's a few more ways of reinterpreting the logical grid.
Extra Wide Grid
#tt.device<
workerGrid = #tt.grid<1x64, (d0, d1) -> (0, d0 * 8 + d1 floordiv 8, d1 mod 8)>,
meshShape = 1,
chipIds = [0]
>
Extra Tall + Transposed Grid
#tt.device<
workerGrid = #tt.grid<64x1, (d0, d1) -> (0, d1 * 8 + d0 floordiv 8, d0 mod 8)>,
meshShape = 1,
chipIds = [0]
>
Staircase
#tt.device<
workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, (d0 + d1) mod 8)>,
meshShape = 1,
chipIds = [0]
>
This could be an interesting starting position for data in implementing matmul as a systolic array in a ring topology.
Lowering to TTNN
While the above device attribute encoding is quite flexible, this does not necessarily mean the target backend can actually support all of these interpretations. TTNN backend will be constrained to support only the specialized grid topologies that are supported by the API.
Grid/Shard Orientation
TODO
Multi-device
Please refer to TTNN Mesh Programming Docs for more information on how to program multi-device systems with TTNN API.
Multi-device TTNN dialect will try and stay as close to the TTNN API as possible. Let's consider what this looks like from the compiler and runtime perspectives:
Compiler
- Device Creation: The TTNN device in the compiler is exactly the same attribute
from the ttir dialect. It will encode the
meshShape
into the flatbuffer which can be directly used to program::ttnn::MeshShape
. - Tensor Layout: Again, the tensor layout is inherited in TTNN dialect from the
ttir dialect. The grid attribute in the tensor layout can be trivially
divided by
meshShape
to determine the shape of the tensor slice on each device. Broadcasting rules can be applied to determine which Distribution Strategy to use:- Mesh Sharded: If the tensor grid is > 1 along the
meshShape
dimensions, the tensor will be sharded across the mesh devices. - Replication: If the tensor needs to be broadcasted for this op, by extension the tensor layout will be replicated across the mesh devices.
- Mesh Sharded: If the tensor grid is > 1 along the
Runtime
- Device Creation: The ttnn runtime will wholesale switch to working with
mesh devices via api
ttnn::multi_device::open_mesh_device
, this is possible because a 1x1 mesh device is a valid single device. The mesh shape during device open will always be1xN
whereN
is the number of deviceIds in the array. Note that this shape can be reinterpreted by flatbuffer programs on the fly withSubMesh
API. - Tensor Creation: Tensor creation in a multi-device system is a bit more
involved. In order to upload a multi-device tensor to the mesh, the host
tensor much first be created with
MultiDeviceHostStorage
. The ttnn runtime can automatically do this duringhandleToHostMemoryConfigOp
:- Regular host tensor will bounce through new tensor with
MultiDeviceHostStorage
type. tensor.to(mesh_device)
will allocate/move the tensor to the mesh device.
- Regular host tensor will bounce through new tensor with
Lowering to TTMetal
In TTMetal dialect we are only constrained by what we've implemented in the tt-mlir compiler, this means it is much more flexible and can theoretically support any of the grid interpretations above.
Test Plan
test/python/device_attr.py
covers all of the examples above and asserts the IR is correctly generated.- Additional functional unit tests will be added as op and runtime support is added.
Concerns
tt.device
is very flexible, but with this flexibility comes the potential for misuse. It's important that the compiler is able to validate the legal configurations of this attribute for the target backend.
'tt' Dialect
TT types and attributes common to all TT dialects. This dialect defines types and attributes common to all TT dialects.
[TOC]
ArchAttr
TT Arch
Syntax:
#tt.arch<
::mlir::tt::Arch # value
>
Enum cases:
- grayskull (
Grayskull
) - wormhole_b0 (
WormholeB0
) - blackhole (
Blackhole
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::Arch | an enum of type Arch |
ArgumentAllocationAttr
Argument allocation attribute in TT dialect
Syntax:
#tt.arg_alloc<
uint64_t, # address
uint64_t, # size
MemorySpace # memorySpace
>
Holds the metadata for the allocation of an function argument i.e. for graph inputs.
Parameters:
Parameter | C++ type | Description |
---|---|---|
address | uint64_t | |
size | uint64_t | |
memorySpace | MemorySpace |
BufferAccessAttr
TT Buffer Access
Syntax:
#tt.buffer_access<
::mlir::tt::BufferAccess # value
>
Enum cases:
- alias (
Alias
) - stream (
Stream
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::BufferAccess | an enum of type BufferAccess |
BufferAttr
Buffer attribute in TT dialect
Syntax:
#tt.buffer<
MemRefType, # memref
BufferAccess # buffer_access
>
Describes the physical footprint and layout of a buffer in L1. Its memref must also have a shape with rank equal to DeviceAttr grid. It also carries a buffer access attribute which can be one of:
- Alias: This buffer aliases a persistent Tensor L1 allocation directly. Implies that no datamovement occurs and the compute kernel just accesses the local allocation directly.
- Stream: This buffer is a temporary destination as a means to get remote data for local computation. Remote data is most likely a a tensor that is allocated in dram, but could also be data from a remote core.
Parameters:
Parameter | C++ type | Description |
---|---|---|
memref | MemRefType | A memref that describes the physical footprint and layout of the buffer. It must also have a shape with rank equal to DeviceAttr grid. |
buffer_access | BufferAccess | How data is accessed through this buffer, alias or stream. |
CPUDescAttr
TT cpu_desc attribute
Syntax:
#tt.cpu_desc<
CPURole, # role
StringAttr # target_triple
>
TT cpu_desc attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
role | CPURole | |
target_triple | StringAttr |
CPURoleAttr
TT CPU Role
Syntax:
#tt.cpu_role<
::mlir::tt::CPURole # value
>
Enum cases:
- host (
Host
) - device (
Device
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::CPURole | an enum of type CPURole |
ChipChannelAttr
TT chip_channel attribute
Syntax:
#tt.chip_channel<
unsigned, # deviceId0
::llvm::ArrayRef<int64_t>, # ethernetCoreCoord0
unsigned, # deviceId1
::llvm::ArrayRef<int64_t> # ethernetCoreCoord1
>
TT chip_channel attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
deviceId0 | unsigned | |
ethernetCoreCoord0 | ::llvm::ArrayRef<int64_t> | |
deviceId1 | unsigned | |
ethernetCoreCoord1 | ::llvm::ArrayRef<int64_t> |
ChipCoordAttr
TT chip_coord attribute
Syntax:
#tt.chip_coord<
unsigned, # rack
unsigned, # shelf
unsigned, # y
unsigned # x
>
TT chip_coord attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
rack | unsigned | |
shelf | unsigned | |
y | unsigned | |
x | unsigned |
ChipDescAttr
TT chip_desc attribute
Syntax:
#tt.chip_desc<
ArchAttr, # arch
::llvm::ArrayRef<int64_t>, # grid
unsigned, # l1Size
unsigned, # numDramChannels
unsigned, # dramChannelSize
unsigned, # nocL1AddressAlignBytes
unsigned, # pcieAddressAlignBytes
unsigned, # nocDRAMAddressAlignBytes
unsigned, # l1UnreservedBase
unsigned, # eriscL1UnreservedBase
unsigned, # dramUnreservedBase
unsigned, # dramUnreservedEnd
ChipPhysicalCoresAttr, # chipPhysicalCores
::llvm::ArrayRef<DataTypeAttr>, # supportedDataTypes
::llvm::ArrayRef<TileSizeAttr>, # supportedTileSizes
unsigned # numCBs
>
TT chip_desc attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
arch | ArchAttr | |
grid | ::llvm::ArrayRef<int64_t> | |
l1Size | unsigned | |
numDramChannels | unsigned | |
dramChannelSize | unsigned | |
nocL1AddressAlignBytes | unsigned | |
pcieAddressAlignBytes | unsigned | |
nocDRAMAddressAlignBytes | unsigned | |
l1UnreservedBase | unsigned | |
eriscL1UnreservedBase | unsigned | |
dramUnreservedBase | unsigned | |
dramUnreservedEnd | unsigned | |
chipPhysicalCores | ChipPhysicalCoresAttr | |
supportedDataTypes | ::llvm::ArrayRef<DataTypeAttr> | |
supportedTileSizes | ::llvm::ArrayRef<TileSizeAttr> | |
numCBs | unsigned |
ChipPhysicalCoresAttr
TT chip_physical_cores attribute
Syntax:
#tt.chip_physical_cores<
::llvm::ArrayRef<CoreCoordAttr>, # worker
::llvm::ArrayRef<CoreCoordAttr>, # dram
::llvm::ArrayRef<CoreCoordAttr>, # eth
::llvm::ArrayRef<CoreCoordAttr> # eth_inactive
>
TT chip_physical_cores attribute containing arrays of physical cores by core type in order of logical cores.
Parameters:
Parameter | C++ type | Description |
---|---|---|
worker | ::llvm::ArrayRef<CoreCoordAttr> | |
dram | ::llvm::ArrayRef<CoreCoordAttr> | |
eth | ::llvm::ArrayRef<CoreCoordAttr> | |
eth_inactive | ::llvm::ArrayRef<CoreCoordAttr> |
CoreCoordAttr
TT core_coord attribute
Syntax:
#tt.core_coord<
int64_t, # y
int64_t # x
>
TT core_coord attribute containing a single physical core coordinate.
Parameters:
Parameter | C++ type | Description |
---|---|---|
y | int64_t | |
x | int64_t |
DataTypeAttr
TT DataTypes
Syntax:
#tt.supportedDataTypes<
::mlir::tt::DataType # value
>
Enum cases:
- f32 (
Float32
) - f16 (
Float16
) - bf16 (
BFloat16
) - bfp_f8 (
BFP_Float8
) - bfp_bf8 (
BFP_BFloat8
) - bfp_f4 (
BFP_Float4
) - bfp_bf4 (
BFP_BFloat4
) - bfp_f2 (
BFP_Float2
) - bfp_bf2 (
BFP_BFloat2
) - u32 (
UInt32
) - u16 (
UInt16
) - u8 (
UInt8
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::DataType | an enum of type DataType |
DeviceAttr
Device attribute in TT dialect.
Syntax:
#tt.device<
::mlir::tt::GridAttr, # workerGrid
AffineMap, # l1Map
AffineMap, # dramMap
::llvm::ArrayRef<int64_t>, # meshShape
::llvm::ArrayRef<unsigned> # chipIds
>
Describes the physical layout of a device in the system and is made up of a few components:
- A grid attribute that describes the device's compute grid shape. It not only describes the shape of the compute grid, but also carries an affine map that describes how the logical grid maps to the physical grid.
- Two affine maps that describe how a tensor layout's linear attribute maps to the L1 and DRAM memory spaces.
- A mesh shape that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.
- An array of chip ids that this device is made up of. This array's length must match the volume of the mesh shape and should be interpreted in row-major order.
Parameters:
Parameter | C++ type | Description |
---|---|---|
workerGrid | ::mlir::tt::GridAttr | TT grid attribute |
l1Map | AffineMap | |
dramMap | AffineMap | |
meshShape | ::llvm::ArrayRef<int64_t> | |
chipIds | ::llvm::ArrayRef<unsigned> |
GridAttr
TT grid attribute
Syntax:
#tt.grid<
::llvm::ArrayRef<int64_t>, # shape
AffineMap # mapping
>
TT grid attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
shape | ::llvm::ArrayRef<int64_t> | |
mapping | AffineMap |
IteratorTypeAttr
TT IteratorType
Syntax:
#tt.iterator_type<
::mlir::tt::IteratorType # value
>
Enum cases:
- parallel (
Parallel
) - systolic (
Systolic
) - broadcast (
Broadcast
) - reduction (
Reduction
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::IteratorType | an enum of type IteratorType |
LayoutAttr
Tensor layout attribute
Syntax:
#tt.layout<
AffineMap, # linear
OOBVal, # oob_val
GridAttr, # grid
MemRefType, # memref
TensorMemoryLayout # mem_layout
>
The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.
Some high level goals
- Logical shapes: Keep the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
- Flexible sharding: Enable flexibility in choosing grid shape, to get better parallelization and avoid resharding. This is particularly important in cases where tensor shapes are not clean powers of two and would otherwise force our hand in choosing non-optimal grid shapes.
- Logical-Physical Isomorphism: Encode this information with just a few attributes to enable derived conversions from logical to physical layout and back.
- Explicit: A single source of truth.
- Enable a direct way to query padded regions.
Please refer to the Tensor Layout Spec for more in depth documentation.
Examples:
tensor<8x300xf32,
#tt.layout<(d0, d1) -> (d0, d1),
undef,
<1x2>,
memref<8x150xf32, #tt.memory_space<l1>>
>
>
tensor<8x96x32xf32,
#tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
undef,
<2x1>,
memref<384x32xf32, #tt.memory_space<l1>>
>
>
tensor<8x96x32xf32,
#tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
undef,
<2x1x2>,
memref<384x96x16xf32, #tt.memory_space<l1>>
>
>
tensor<5x3x2x2x7x32x32xf32,
#tt.layout<
(d0, d1, d2, d3, d4, d5, d6)
-> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
undef,
<3x2x2x2>,
memref<4480x4x16x16xf32, #tt.memory_space<l1>>
>
>
Parameters:
Parameter | C++ type | Description |
---|---|---|
linear | AffineMap | An affine map that defines how the logical tensor dimensions map to a grid shape. |
oob_val | OOBVal | A tracked out of bounds value that fills padding space. |
grid | GridAttr | The grid shape that this tensor is divided onto. |
memref | MemRefType | A memref that describes the physical footprint allocation of the shard. It must also have a shape with rank equal to grid. |
mem_layout | TensorMemoryLayout | The layout of the tensor in memory. |
MemorySpaceAttr
TT MemorySpace
Syntax:
#tt.memory_space<
::mlir::tt::MemorySpace # value
>
Enum cases:
- system (
System
) - mmio (
SystemMMIO
) - dram (
DeviceDRAM
) - l1 (
DeviceL1
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::MemorySpace | an enum of type MemorySpace |
OOBValAttr
TT OOBVal
Syntax:
#tt.oob_val<
::mlir::tt::OOBVal # value
>
Enum cases:
- undef (
Undef
) - zero (
Zero
) - one (
One
) - inf (
Inf
) - neginf (
NegInf
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::OOBVal | an enum of type OOBVal |
OperandConstraintAttr
TT Operand Constraints
Syntax:
#tt.operand_constraint<
::mlir::tt::OperandConstraint # value
>
Enum cases:
- system (
System
) - dram (
DRAM
) - l1 (
L1
) - scalar (
Scalar
) - tile (
Tile
) - none (
None
) - interleaved (
Interleaved
) - single_bank (
SingleBank
) - height_sharded (
HeightSharded
) - width_sharded (
WidthSharded
) - block_sharded (
BlockSharded
) - any_layout (
AnyLayout
) - any (
Any
) - any_device (
AnyDevice
) - any_device_tile (
AnyDeviceTile
) - l1_block_sharded (
L1BlockSharded
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::OperandConstraint | an enum of type OperandConstraint |
SystemDescAttr
TT system_desc attribute
Syntax:
#tt.system_desc<
::llvm::ArrayRef<CPUDescAttr>, # cpuDescs
::llvm::ArrayRef<ChipDescAttr>, # chipDescs
::llvm::ArrayRef<unsigned>, # chipDescIndices
::llvm::ArrayRef<ChipCapabilityAttr>, # chipCapabilities
::llvm::ArrayRef<ChipCoordAttr>, # chipCoords
::llvm::ArrayRef<ChipChannelAttr> # chipChannels
>
TT system_desc attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
cpuDescs | ::llvm::ArrayRef<CPUDescAttr> | |
chipDescs | ::llvm::ArrayRef<ChipDescAttr> | |
chipDescIndices | ::llvm::ArrayRef<unsigned> | |
chipCapabilities | ::llvm::ArrayRef<ChipCapabilityAttr> | |
chipCoords | ::llvm::ArrayRef<ChipCoordAttr> | |
chipChannels | ::llvm::ArrayRef<ChipChannelAttr> |
TensorMemoryLayoutAttr
TT TensorMemoryLayout
Syntax:
#tt.tensor_memory_layout<
::mlir::tt::TensorMemoryLayout # value
>
Enum cases:
- none (
None
) - interleaved (
Interleaved
) - single_bank (
SingleBank
) - height_sharded (
HeightSharded
) - width_sharded (
WidthSharded
) - block_sharded (
BlockSharded
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::TensorMemoryLayout | an enum of type TensorMemoryLayout |
TileSizeAttr
TT tile_size attribute
Syntax:
#tt.tile_size<
int64_t, # y
int64_t # x
>
TT tile_size attribute containing a supported Tensix tile shape.
Parameters:
Parameter | C++ type | Description |
---|---|---|
y | int64_t | |
x | int64_t |
DeviceType
TT device
Syntax:
!tt.device<
::mlir::tt::DeviceAttr # desc
>
Device type in TT dialect
Parameters:
Parameter | C++ type | Description |
---|---|---|
desc | ::mlir::tt::DeviceAttr | Device attribute in TT dialect. |
TileType
TT tile
Syntax:
!tt.tile<
::llvm::ArrayRef<int64_t>, # shape
DataType # dataType
>
Tile type in TT dialect
Parameters:
Parameter | C++ type | Description |
---|---|---|
shape | ::llvm::ArrayRef<int64_t> | |
dataType | DataType |
'ttir' Dialect
TTIR dialect provides high level semantics for dispatching work to TT HW. This dialect provides high level semantics for dispatching work to TT HW. It defines a set of declarative/high level operations that are used to describe the dispatch, but is largely agnostic to the set of operations or dialects that are actually supported by a consuming backend.
[TOC]
ttir.abs
(tt::ttir::AbsOp)
Eltwise absolute op.
Eltwise absolute operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.add
(tt::ttir::AddOp)
Eltwise add.
Eltwise add operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
, TTIR_GenericRegionOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.all_gather
(tt::ttir::AllGatherOp)
All gather operation.
All gather op.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.alloc
(tt::ttir::AllocOp)
Alloc op.
Tensor Alloc operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
address | ::mlir::IntegerAttr | 64-bit signless integer attribute |
size | ::mlir::IntegerAttr | 64-bit signless integer attribute |
memory_space | ::mlir::tt::MemorySpaceAttr | TT MemorySpace{{% markdown %}}Enum cases: * system (`System`) * mmio (`SystemMMIO`) * dram (`DeviceDRAM`) * l1 (`DeviceL1`){{% /markdown %}} |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.broadcast
(tt::ttir::BroadcastOp)
Broadcast operation.
Broadcast op.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::ArrayAttr | 64-bit integer array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.cbrt
(tt::ttir::CbrtOp)
Eltwise cubic root op.
Eltwise cubic root operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.ceil
(tt::ttir::CeilOp)
Eltwise ceil op.
Eltwise ceil operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.clamp
(tt::ttir::ClampOp)
Clamp op.
Clamp tensor values to a specified range.
Example: min: 2.000000+00 input: [[0, 1, 2, 3, 4, 5, 6, 7]] max: 5.000000+00
"ttir.clamp"(%arg0) <{max = 2.000000e+00 : f32, min = 5.000000e+00 : f32}> -> %out = [[2, 2, 2, 3, 4, 5, 5, 5]]
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
min | ::mlir::FloatAttr | 32-bit float attribute |
max | ::mlir::FloatAttr | 32-bit float attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.concat
(tt::ttir::ConcatOp)
Concat op.
Concat tensors along a given dimension.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.constant
(tt::ttir::ConstantOp)
Constant op.
Produces tensor filled with given constant value.
Examples: %0 = "ttir.constant"() {value = dense<0> : tensor<2x3xi32>} : () -> tensor<2x3xi32> // %0: [[0, 0, 0], [0, 0, 0]] %1 = "ttir.constant"() {value = dense<[0.2, 1.3]> : tensor<2xf32>} : () -> tensor<2xf32> // %1: [0.2, 1.3]
Traits: ConstantLike
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
value | ::mlir::ElementsAttr | constant vector/tensor attribute |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.conv2d
(tt::ttir::Conv2dOp)
Conv2d operation.
Applies a 2D convolution over an input image composed of several input planes.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
stride_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
stride_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
groups | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_left | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_right | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_top | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_bottom | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
weight | ranked tensor of any type values |
bias | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.convolution
(tt::ttir::ConvolutionOp)
Generalized convolution op.
Applies a convolution of the rhs with the lhs.
This operation captures convolutions of all dimensionality as well as deconvolution/conv transpose.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
window_strides | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
padding | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
input_dilation | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
weight_dilation | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
window_reversal | ::mlir::DenseBoolArrayAttr | i1 dense array attribute |
convolution_layout | ::mlir::tt::ttir::ConvolutionLayoutAttr | Structure of dimension information for convolution op{{% markdown %}} Holds the layout information for the input activation, weights, and output. {{% /markdown %}} |
feature_group_count | ::mlir::IntegerAttr | 64-bit signless integer attribute whose value is positive |
batch_group_count | ::mlir::IntegerAttr | 64-bit signless integer attribute whose value is positive |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
weight | ranked tensor of any type values |
bias | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
«unnamed» | ranked tensor of any type values |
ttir.cos
(tt::ttir::CosOp)
Eltwise cosine op.
Eltwise cosine operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.dealloc
(tt::ttir::DeallocOp)
Dealloc op.
Tensor Dealloc operation
Operands:
Operand | Description |
---|---|
result | ranked tensor of any type values |
ttir.div
(tt::ttir::DivOp)
Eltwise divide.
Eltwise divide operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
, TTIR_GenericRegionOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.embedding
(tt::ttir::EmbeddingOp)
Embedding op.
Embedding operation.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
weight | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.eq
(tt::ttir::EqualOp)
Eltwise equal to.
Eltwise equal to operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.exp
(tt::ttir::ExpOp)
Eltwise exponential op.
Eltwise exponential operation. Calculates e^x for all elements x in input tensor.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
, TTIR_GenericRegionOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.expm1
(tt::ttir::Expm1Op)
Eltwise unary op.
Performs element-wise exponential minus one operation on operand
tensor
and stores the result in the output tensor.
Example: %a: [[0, 1], [0, 0]] "ttir.exmp1"(%a, %out) -> %out: [[0, 1.71828], [0, 0]]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.fill
(tt::ttir::FillOp)
Fill operation.
Produces tensor filled with given fill value.
Examples: %0 = tensor.empty() : () -> tensor<2x3xi32> %1 = "ttir.fill"(%0) {value = dense<0> : tensor<2x3xi32>} : () -> tensor<2x3xi32> %2 = tensor.empty() : () -> tensor<2xf32> %3 = "ttir.fill"(%2) {value = dense<[0.2, 1.3]> : tensor<2xf32>} : () -> tensor<2xf32>
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
value | ::mlir::ElementsAttr | constant vector/tensor attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.floor
(tt::ttir::FloorOp)
Eltwise floor op.
Eltwise floor operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.gather
(tt::ttir::GatherOp)
Gather operation.
Gathers slices from operand tensor from offsets specified in start_indices and produces a result tensor. From StableHLO Gather Op: https://openxla.org/stablehlo/spec#gather
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
offset_dims | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
collapsed_slice_dims | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
operand_batching_dims | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
start_indices_batching_dims | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
start_index_map | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
index_vector_dim | ::mlir::IntegerAttr | 64-bit signed integer attribute |
slice_sizes | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
indices_are_sorted | ::mlir::BoolAttr | bool attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
start_indices | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.gelu
(tt::ttir::GeluOp)
Eltwise GELU op.
Eltwise GELU operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.generic
(tt::ttir::GenericOp)
Generically dispatch work to a grid of cores.
This generic op carries a region that represents the work each core does. The region is expected to have the same signature as the op itself with respect to input and output operands. The op is expected to be lowered to a backend specific form by a consuming backend. This op is heavily inspired by the linalg.generic op so it can be useful to refer to linalg.generic documentation for more details.
%5 = "ttir.generic"(%1, %2, %3, %4) <{
grid = #tt.grid<1x1>, // The grid range of cores to dispatch work to.
indexing_maps = [#map, #map, #map], // Affine maps for indexing into the input/output tensors. See linalg.generic
iterator_types = [#parallel, #parallel], // Iterator types for the input/output tensors. See linalg.generic
operandSegmentSizes = array<i32: 2, 1, 1>, // Sizes of the operand segments, i.e. 2 inputs, 1 cb and 1 output.
operand_cb_mapping = array<i64: -1, 0, -1>, // Mapping of input & output operands to cbs. -1 means no mapping.
// Mapped operands correspond to buffers in streaming mode.
// Non-mapped operands correspond to buffers in alias mode.
({
^bb0(%arg2: tensor<64x128xf32, #tt.buffer<memref<64x128xf32, #l1_>, alias>>,
%arg3: tensor<64x128xf32, #tt.buffer<memref<64x128xf32, #l1_>, stream>>,
%arg4: tensor<64x128xf32, #tt.buffer<memref<64x128xf32, #l1_>, alias>>):
// Region body, would contain some computation that represents the work each core does.
}) : (tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
grid | ::mlir::tt::GridAttr | TT grid attribute{{% markdown %}} TT grid attribute {{% /markdown %}} |
indexing_maps | ::mlir::ArrayAttr | AffineMap array attribute |
iterator_types | ::mlir::ArrayAttr | |
operand_constraints | ::mlir::ArrayAttr | |
operand_cb_mapping | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
cbs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.get_dimension_size
(tt::ttir::GetDimensionSizeOp)
GetDimensionSize op.
Produces the size of the given dimension
of the operand
.
Example: %operand: [[3, 2, 7], [1, 4, 4]] "ttir.get_dimension_size"(%operand, value = dense<0>, %out) -> %out: [[3]]
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:
Operand | Description |
---|---|
operand | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.ge
(tt::ttir::GreaterEqualOp)
Eltwise greater than or equal to.
Eltwise greater than or equal to operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.gt
(tt::ttir::GreaterThanOp)
Eltwise greater than.
Eltwise greater than operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.index
(tt::ttir::IndexOp)
Index op.
Extract a sub-tensor (slice) from the input tensor along a specified dimension.
The begin
, end
, and step
attributes define the start, stop, and step indices for the
selected dimension (dim
) of the tensor.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim | ::mlir::IntegerAttr | 32-bit signless integer attribute |
begin | ::mlir::IntegerAttr | 32-bit signless integer attribute |
end | ::mlir::IntegerAttr | 32-bit signless integer attribute |
step | ::mlir::IntegerAttr | 32-bit signless integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.isfinite
(tt::ttir::IsFiniteOp)
Eltwise isfinite op.
Eltwise isfinite operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.kernel
(tt::ttir::KernelOp)
Kernel call.
A generic kernel call operation. This operation is used to pattern match by some consuming backend.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
op | ::mlir::FlatSymbolRefAttr | flat symbol reference attribute |
kind | ::mlir::FlatSymbolRefAttr | flat symbol reference attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values or non-0-ranked.memref of any type values |
outputs | variadic of ranked tensor of any type values or non-0-ranked.memref of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values or non-0-ranked.memref of any type values |
ttir.leaky_relu
(tt::ttir::LeakyReluOp)
Eltwise leaky relu operation.
The Leaky ReLU (Rectified Linear Unit) operation computes an element-wise activation function over its input tensor. It is defined as:
y = x if x > 0 y = parameter * x if x <= 0
where parameter
is a small, user-defined constant that determines the slope for
negative inputs.
Attributes:
parameter
(float): The slope for negative values.
Inputs:
input
(Tensor): The input tensor to be activated.
Outputs:
output
(Tensor): The tensor after applying the Leaky ReLU activation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
parameter | ::mlir::FloatAttr | 32-bit float attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.le
(tt::ttir::LessEqualOp)
Eltwise less than or equal to.
Eltwise less than or equal to operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.lt
(tt::ttir::LessThanOp)
Eltwise less than.
Eltwise less than operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.log1p
(tt::ttir::Log1pOp)
Eltwise log1p operation.
Performs element-wise logarithm plus one operation on operand
tensor and
puts the result in the output tensor.
Example: %a: [0.0, -0.999, 7.0, 6.38905621, 15.0] "ttir.logp1"(%a, %out) -> %out: [0.0, -6.90776825, 2.07944155, 2.0, 2.77258873]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.log
(tt::ttir::LogOp)
Eltwise logarithm op.
Eltwise logarithm operation. Calculates log(x) for all elements x in input tensor.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.logical_and
(tt::ttir::LogicalAndOp)
Eltwise logical and.
Eltwise logical and operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.logical_not
(tt::ttir::LogicalNotOp)
Eltwise logical not op.
Eltwise logical not operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.logical_or
(tt::ttir::LogicalOrOp)
Eltwise logical or.
Eltwise logical or operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.logical_xor
(tt::ttir::LogicalXorOp)
Eltwise logical xor.
Eltwise logical xor operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.matmul
(tt::ttir::MatmulOp)
Matrix multiply operation.
Matrix multiply operation.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
a | ranked tensor of any type values |
b | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.max
(tt::ttir::MaxOp)
Max reduction op.
Max reduction op.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_GenericRegionOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.max_pool2d
(tt::ttir::MaxPool2dOp)
Applies a 2D max pooling over an input signal composed of several input planes.
Applies a 2D max pooling over an input signal composed of several input planes.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
kernel_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
kernel_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
stride_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
stride_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
ceil_mode | ::mlir::BoolAttr | bool attribute |
padding_left | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_right | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_top | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_bottom | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.maximum
(tt::ttir::MaximumOp)
Eltwise maximum OP.
Calculates maximum of input tensors' values element-wise and stores result in output tensor.
Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttir.maximum"(%lhs, %rhs, %out) -> %out: [[3, 4, 7], [1, 4, 4]]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.mean
(tt::ttir::MeanOp)
Mean reduction op.
Mean reduction op.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_GenericRegionOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.minimum
(tt::ttir::MinimumOp)
Eltwise minimum OP.
Calculates minimum of input tensors' values element-wise and stores result in output tensor.
Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttir.minimum"(%lhs, %rhs, %out) -> %out: [[1, 2, 2], [1, 2, 3]]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.multiply
(tt::ttir::MultiplyOp)
Eltwise multiply.
Eltwise multiply operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
, TTIR_GenericRegionOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.neg
(tt::ttir::NegOp)
Eltwise negate op.
Eltwise negate operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.ne
(tt::ttir::NotEqualOp)
Eltwise not equal to.
Eltwise not equal to operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.pooling
(tt::ttir::PoolingOp)
General pooling op
General pooling op
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
pooling_method | ::mlir::tt::ttir::PoolingMethodAttr | TTIR PoolingMethod{{% markdown %}}Enum cases: * Average (`Average`) * Max (`Max`){{% /markdown %}} |
window_dimensions | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
window_strides | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
base_dilations | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
window_dilations | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
padding | ::mlir::DenseI64ArrayAttr | i64 dense array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
«unnamed» | variadic of ranked tensor of any type values |
ttir.reciprocal
(tt::ttir::ReciprocalOp)
Eltwise reciprocal.
Eltwise reciprocal operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.relu
(tt::ttir::ReluOp)
Eltwise ReLU.
Eltwise ReLU operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.remainder
(tt::ttir::RemainderOp)
Eltwise remainder.
Performs element-wise remainder of dividend lhs and divisor rhs tensors and produces a result tensor.
Example:
// %lhs: [17, -17, 17, -17] // %rhs: [3, 3, -3, -3] %result = "ttir.remainder"(%lhs, %rhs) : (tensor<4xi64>, tensor<4xi64>) -> tensor<4xi64> // %result: [2, -2, 2, -2]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.reshape
(tt::ttir::ReshapeOp)
Reshape op.
Reshape tensor.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
shape | ::mlir::ArrayAttr | 32-bit integer array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.rsqrt
(tt::ttir::RsqrtOp)
Eltwise reciprocal square root.
Eltwise reciprocal square root operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.sigmoid
(tt::ttir::SigmoidOp)
Eltwise sigmoid.
Eltwise sigmoid operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.sign
(tt::ttir::SignOp)
Eltwise sign operation.
Returns the sign of the operand
element-wise and produces a result
tensor.
Example: %a: [[3, -2, 0], [1, -4, 4]] "ttir.sign"(%a, %out) -> %out: [[1, -1, 0], [1, -1, 1]]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.sin
(tt::ttir::SinOp)
Eltwise sine.
Eltwise sine operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.slice
(tt::ttir::SliceOp)
Slice op.
Extract a sub-tensor (slice) from the input tensor across one or more dimensions.
The begins
, ends
, and step
attributes specify the start, stop, and step indices
for each dimension of the tensor.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
begins | ::mlir::ArrayAttr | 32-bit integer array attribute |
ends | ::mlir::ArrayAttr | 32-bit integer array attribute |
step | ::mlir::ArrayAttr | 32-bit integer array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.softmax
(tt::ttir::SoftmaxOp)
Softmax operation.
Softmax operation.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.sqrt
(tt::ttir::SqrtOp)
Eltwise square root.
Eltwise square root operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.squeeze
(tt::ttir::SqueezeOp)
Squeeze op.
Squeeze tensor.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.subtract
(tt::ttir::SubtractOp)
Eltwise subtract.
Eltwise subtract operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.sum
(tt::ttir::SumOp)
Sum reduction op.
Sum reduction op.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_GenericRegionOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.to_layout
(tt::ttir::ToLayoutOp)
Layout op.
ToLayout operation, transition tensors from one layout to another. Some examples include:
- Transitioning between different memory spaces, e.g. DRAM to L1.
- Transitioning between different data types, e.g. f32 to f16.
- Transitioning between different tile sizes, e.g. 1x16 to 32x32
- Transitioning between different tensor sharding
- Some combination of the above
#layout = #tt.layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #system>>
#layout1 = #tt.layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #l1_>>
%1 = "ttir.to_layout"(%arg0, %0) : (tensor<64x128xf32, #layout>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.transpose
(tt::ttir::TransposeOp)
Transpose op.
Transpose tensor along two given dimensions.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim0 | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dim1 | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.typecast
(tt::ttir::TypecastOp)
Eltwise cast op.
Eltwise cast operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.unsqueeze
(tt::ttir::UnsqueezeOp)
Unsqueeze op.
Unsqueeze tensor.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.where
(tt::ttir::WhereOp)
Eltwise where op.
Eltwise where operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.yield
(tt::ttir::YieldOp)
Yield op.
Yield operation, this is required by MLIR to mark the end of a dispatch region.
Traits: AlwaysSpeculatableImplTrait
, ReturnLike
, Terminator
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
, RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
Operands:
Operand | Description |
---|---|
values | variadic of ranked tensor of any type values or non-0-ranked.memref of any type values |
'ttkernel' Dialect
A TTKernel out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.
[TOC]
EthernetConfigAttr
TT EthernetConfig attribute
Syntax:
#ttkernel.ethernet_config<
EthType, # eth_type
NocIndex # noc_index
>
TT ethernet_config attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
eth_type | EthType | |
noc_index | NocIndex |
NocConfigAttr
TT NocConfig attribute
Syntax:
#ttkernel.noc_config<
NocIndex # noc_index
>
TT noc_config attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
noc_index | NocIndex |
ReduceDimAttr
TTKernel Reduce Dimensions
Syntax:
#ttkernel.reduce_dim<
::mlir::tt::ttkernel::ReduceDim # value
>
Enum cases:
- reduce_dim_row (
Row
) - reduce_dim_col (
Col
) - reduce_dim_scalar (
Scalar
) - reduce_dim_none (
None
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::ttkernel::ReduceDim | an enum of type ReduceDim |
ReduceTypeAttr
TTKernel Reduce Types
Syntax:
#ttkernel.reduce_type<
::mlir::tt::ttkernel::ReduceType # value
>
Enum cases:
- reduce_sum (
Sum
) - reduce_max (
Max
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::ttkernel::ReduceType | an enum of type ReduceType |
TensixConfigAttr
TT TensixConfig attribute
Syntax:
#ttkernel.tensix_config<
MathFidelity, # math_fidelity
bool, # fp32_dest_acc_en
bool, # math_approx_mode
::llvm::ArrayRef<UnpackToDestMode> # unpack_to_dest_mode
>
TT compute_desc attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
math_fidelity | MathFidelity | |
fp32_dest_acc_en | bool | |
math_approx_mode | bool | |
unpack_to_dest_mode | ::llvm::ArrayRef<UnpackToDestMode> |
ThreadTypeAttr
TTKernel ThreadTypes
Syntax:
#ttkernel.thread<
::mlir::tt::ttkernel::ThreadType # value
>
Enum cases:
- noc (
Noc
) - tensix (
Tensix
) - ethernet (
Ethernet
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::ttkernel::ThreadType | an enum of type ThreadType |
ttkernel.add
(tt::ttkernel::AddOp)
Add operation
Add operation
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
ttkernel.add_tiles_init
(tt::ttkernel::AddTilesInitOp)
Short init function
Must be run before add_tiles.
Operands:
Operand | Description |
---|---|
in0_cb | TTKernel cb |
in1_cb | TTKernel cb |
ttkernel.add_tiles
(tt::ttkernel::AddTilesOp)
Add operation
Performs element-wise addition C=A+B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.
Operands:
Operand | Description |
---|---|
in0_cb | TTKernel cb |
in1_cb | TTKernel cb |
in0_tile_index | 32-bit signless integer |
in1_tile_index | 32-bit signless integer |
dst_index | 32-bit signless integer |
ttkernel.binary_op_init_common
(tt::ttkernel::BinaryOpInitCommonOp)
Init function for all binary ops
Followed by the specific init required with an opcode (binrary_op_specific_init).
Operands:
Operand | Description |
---|---|
in0_cb | TTKernel cb |
in1_cb | TTKernel cb |
out_cb | TTKernel cb |
ttkernel.builtin
(tt::ttkernel::BuiltinOp)
Builtin call.
Kernel operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
op | ::mlir::FlatSymbolRefAttr | flat symbol reference attribute |
kind | ::mlir::FlatSymbolRefAttr | flat symbol reference attribute |
Operands:
Operand | Description |
---|---|
args | variadic of non-0-ranked.memref of any type values or TTKernel cb |
ttkernel.cb_pop_front
(tt::ttkernel::CBPopFrontOp)
CBPopFront call.
CBPopFront operation
Operands:
Operand | Description |
---|---|
cb | TTKernel cb |
numPages | 32-bit signless integer |
ttkernel.cb_push_back
(tt::ttkernel::CBPushBackOp)
CBPushBack call.
CBPushBack operation
Operands:
Operand | Description |
---|---|
cb | TTKernel cb |
numPages | 32-bit signless integer |
ttkernel.cb_reserve_back
(tt::ttkernel::CBReserveBackOp)
CBReserveBack call.
CBReserveBack operation
Operands:
Operand | Description |
---|---|
cb | TTKernel cb |
numPages | 32-bit signless integer |
ttkernel.cb_wait_front
(tt::ttkernel::CBWaitFrontOp)
CBWaitFront call.
CBWaitFront operation
Operands:
Operand | Description |
---|---|
cb | TTKernel cb |
numPages | 32-bit signless integer |
ttkernel.reinterpret_cast<volatile tt_l1_ptr uint32_t*>
(tt::ttkernel::CastToL1PtrOp)
CastToL1Ptr
Cast specified addr to L1 pointer.
Interfaces: InferTypeOpInterface
Operands:
Operand | Description |
---|---|
addr | 32-bit signless integer or TTKernel l1 address |
Results:
Result | Description |
---|---|
l1_ptr | TTKernel l1 address pointer |
ttkernel.copy_tile_init
(tt::ttkernel::CopyTileInitOp)
Perform the init for copy tile. This does not reconfigure the unpacker data types.
Must be called before copy_tile.
ttkernel.copy_tile
(tt::ttkernel::CopyTileOp)
Copy tile from specified CB to DST.
Copies a single tile from the specified input CB and writes the result to DST at a specified index. The function will employ unpacker to first unpack into SRC registers and then perform move into DST registers, at a specified index. For the in_tile_index to be valid for this call, cb_wait_front(n) had to be previously called to ensure that at least some number n>0 of tiles are available in the input CB. The CB index 0 then references the first tile in the received section of the CB, up to index n-1 (in a FIFO order). The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.
Operands:
Operand | Description |
---|---|
cb0 | TTKernel cb |
tile_index_cb | 32-bit signless integer |
tile_index_dst | 32-bit signless integer |
ttkernel.exp_tile_init
(tt::ttkernel::ExpTileInitOp)
Short init function which configures compute unit for execution of exp_tile.
Must be run before exp_tile.
ttkernel.exp_tile
(tt::ttkernel::ExpTileOp)
Exp operation
Performs element-wise computation of exponential on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.
Operands:
Operand | Description |
---|---|
tile_index | 32-bit signless integer |
ttkernel.get_noc_addr
(tt::ttkernel::GetNocAddrOp)
GetNocAddr
GetNocAddr
Interfaces: InferTypeOpInterface
Operands:
Operand | Description |
---|---|
l1Address | 32-bit signless integer |
Results:
Result | Description |
---|---|
nocAddr | TTKernel noc address |
ttkernel.get_noc_addr_xy
(tt::ttkernel::GetNocAddrXYOp)
GetNocAddrXY
GetNocAddr api including core coordinates
Interfaces: InferTypeOpInterface
Operands:
Operand | Description |
---|---|
x | 32-bit signless integer |
y | 32-bit signless integer |
l1Address | 32-bit signless integer |
Results:
Result | Description |
---|---|
nocAddr | TTKernel noc address |
ttkernel.get_write_ptr
(tt::ttkernel::GetWritePtrOp)
GetWritePtr
GetWritePtr operation
Interfaces: InferTypeOpInterface
Operands:
Operand | Description |
---|---|
cb | TTKernel cb |
Results:
Result | Description |
---|---|
writePtr | 32-bit signless integer |
ttkernel.matmul
(tt::ttkernel::MatmulOp)
Matmul operation
Matmul operation
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
ttkernel.mem_zeros_base
(tt::ttkernel::MemZerosBaseOp)
Op corresponding to MEM_ZEROS_BASE macro in kernels.
Op corresponding to MEM_ZEROS_BASE macro in kernels.
Interfaces: InferTypeOpInterface
Results:
Result | Description |
---|---|
result | 32-bit signless integer |
ttkernel.mem_zeros_size
(tt::ttkernel::MemZerosSizeOp)
Op corresponding to MEM_ZEROS_SIZE macro in kernels.
Op corresponding to MEM_ZEROS_SIZE macro in kernels.
Interfaces: InferTypeOpInterface
Results:
Result | Description |
---|---|
result | 32-bit signless integer |
ttkernel.mul
(tt::ttkernel::MulOp)
Mul operation
Mul operation
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
ttkernel.mul_tiles_init_f
(tt::ttkernel::MulTilesInitFOp)
Short init function. Init for math only.
Must be run before mul_tiles.
ttkernel.mul_tiles_init
(tt::ttkernel::MulTilesInitOp)
Short init function
Must be run before mul_tiles.
Operands:
Operand | Description |
---|---|
in0_cb | TTKernel cb |
in1_cb | TTKernel cb |
ttkernel.mul_tiles
(tt::ttkernel::MulTilesOp)
Mul operation
Performs element-wise multiplication C=A*B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.
Operands:
Operand | Description |
---|---|
in0_cb | TTKernel cb |
in1_cb | TTKernel cb |
in0_tile_index | 32-bit signless integer |
in1_tile_index | 32-bit signless integer |
dst_index | 32-bit signless integer |
ttkernel.noc_async_read_barrier
(tt::ttkernel::NocAsyncReadBarrierOp)
NocAsyncReadBarrier
NocAsyncReadBarrier
ttkernel.noc_async_read_one_packet_set_state
(tt::ttkernel::NocAsyncReadOnePacketSetStateOp)
NocAsyncReadOnePacketSetState
NocAsyncReadOnePacketSetState
Operands:
Operand | Description |
---|---|
srcNocAddr | TTKernel noc address |
size | 32-bit signless integer |
ttkernel.noc_async_read_one_packet_with_state
(tt::ttkernel::NocAsyncReadOnePacketWithStateOp)
NocAsyncReadOnePacketWithState
NocAsyncReadOnePacketWithState
Operands:
Operand | Description |
---|---|
srcNocAddr | TTKernel noc address |
dstLocalL1Addr | 32-bit signless integer or TTKernel l1 address |
ttkernel.noc_async_read
(tt::ttkernel::NocAsyncReadOp)
NocAsyncRead
NocAsyncRead
Operands:
Operand | Description |
---|---|
srcNocAddr | TTKernel noc address |
dstLocalL1Addr | 32-bit signless integer |
size | 32-bit signless integer |
ttkernel.noc_async_write_barrier
(tt::ttkernel::NocAsyncWriteBarrierOp)
NocAsyncWriteBarrier
NocAsyncWriteBarrier
ttkernel.noc_async_write
(tt::ttkernel::NocAsyncWriteOp)
NocAsyncWrite
NocAsyncWrite
Operands:
Operand | Description |
---|---|
srcLocalL1Addr | 32-bit signless integer |
dstNocAddr | TTKernel noc address |
size | 32-bit signless integer |
ttkernel.pack
(tt::ttkernel::PackOp)
Pack op.
Pack operation
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
out_cb | TTKernel cb |
out_index | 32-bit signless integer |
ttkernel.pack_set_data_type
(tt::ttkernel::PackSetDataTypeOp)
Pack set DataType op.
Pack set DataType operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
data_type | ::mlir::IntegerAttr | TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}} |
ttkernel.pack_tile
(tt::ttkernel::PackTileOp)
PackTile op.
Copies a single tile from the DST register buffer at a specified index to a specified CB at a given index. For the out_tile_index to be valid for this call, cb_reserve_back(n) has to be called first to reserve at least some number n > 0 of tiles in the output CB. out_tile_index = 0 then references the first tile in the reserved section of the CB, up to index n - 1, which will then be visible to the consumer in the same order after a cb_push_back call. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine.
Each subsequent pack call will increment the write pointer in the cb by single tile size. The pointer is then again set to a valid position with space for n reserved tiles by another cb_reserve_back call.
Operates in tandem with functions cb_reserve_back and cb_push_back.
A typical use case is first the producer ensures that there is a number of tiles available in the buffer via cb_reserve_back, then the producer uses the pack_tile call to copy a tile from one of DST slots to a slot in reserved space and finally cb_push_back is called to announce visibility of the reserved section of the circular buffer to the consumer.
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
out_cb | TTKernel cb |
out_index | 32-bit signless integer |
ttkernel.recip_tile_init
(tt::ttkernel::RecipTileInitOp)
Init function for recip_tile operation. Refer to documentation for any init function.
Must be called before recip_tile function.
ttkernel.recip_tile
(tt::ttkernel::RecipTileOp)
Recip tile in the DST at specified index.
Performs element-wise computation of the reciprocal on each element of a tile in DST register at index tile_index. The DST register buffer must be in acquired state via tile_regs_acquire call. This call is blocking and is only available on the compute engine. Only works for Float32, Float16_b, Bfp8_b data formats for full accuracy.
Operands:
Operand | Description |
---|---|
tile_index | 32-bit signless integer |
ttkernel.reduce_init
(tt::ttkernel::ReduceInitOp)
Init function
Must be run before reduce_tile.
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
reduce_type | ::mlir::tt::ttkernel::ReduceTypeAttr | TTKernel Reduce Types{{% markdown %}}Enum cases: * reduce_sum (`Sum`) * reduce_max (`Max`){{% /markdown %}} |
reduce_dim | ::mlir::tt::ttkernel::ReduceDimAttr | TTKernel Reduce Dimensions{{% markdown %}}Enum cases: * reduce_dim_row (`Row`) * reduce_dim_col (`Col`) * reduce_dim_scalar (`Scalar`) * reduce_dim_none (`None`){{% /markdown %}} |
Operands:
Operand | Description |
---|---|
in_cb | TTKernel cb |
scaling_cb | TTKernel cb |
out_cb | TTKernel cb |
ttkernel.reduce_tile
(tt::ttkernel::ReduceTileOp)
Reduce operation
Performs a reduction operation B = reduce(A) using reduce_func for dimension reduction on a tile in the CB at a given index and writes the result to the DST register at index dst_tile_index. Reduction can be either of type Reduce::R, Reduce::C or Reduce::RC, identifying the dimension(s) to be reduced in size to 1. The DST register buffer must be in acquired state via tile_regs_acquire call. The templates takes reduce_type which can be ReduceFunc::Sum, ReduceFunc::Max and reduce_dim which can be Reduce::R, Reduce::C, Reduce::RC. They can also be specified by defines REDUCE_OP and REDUCE_DIM. This call is blocking and is only available on the compute engine.
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
reduce_type | ::mlir::tt::ttkernel::ReduceTypeAttr | TTKernel Reduce Types{{% markdown %}}Enum cases: * reduce_sum (`Sum`) * reduce_max (`Max`){{% /markdown %}} |
reduce_dim | ::mlir::tt::ttkernel::ReduceDimAttr | TTKernel Reduce Dimensions{{% markdown %}}Enum cases: * reduce_dim_row (`Row`) * reduce_dim_col (`Col`) * reduce_dim_scalar (`Scalar`) * reduce_dim_none (`None`){{% /markdown %}} |
Operands:
Operand | Description |
---|---|
in_cb | TTKernel cb |
scaling_cb | TTKernel cb |
in_tile_index | 32-bit signless integer |
scaling_tile_index | 32-bit signless integer |
dst_index | 32-bit signless integer |
ttkernel.return
(tt::ttkernel::ReturnOp)
Return op.
Return operation
Traits: AlwaysSpeculatableImplTrait
, ReturnLike
, Terminator
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
, RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
ttkernel.store_to_l1
(tt::ttkernel::StoreToL1Op)
StoreToL1
Store value to L1.
Operands:
Operand | Description |
---|---|
value | 32-bit signless integer |
l1_ptr | TTKernel l1 address pointer |
offset | 32-bit signless integer |
ttkernel.sub
(tt::ttkernel::SubOp)
Sub operation
Sub operation
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
ttkernel.tile_regs_acquire
(tt::ttkernel::TileRegsAcquireOp)
Tile_regs_acquire
Acquire an exclusive lock on the DST register for the MATH thread. This register is an array of 16 tiles of 32x32 elements each. This is a blocking function, i.e. this function will wait until the lock is acquired.
ttkernel.tile_regs_commit
(tt::ttkernel::TileRegsCommitOp)
Tile_regs_commit
Release lock on DST register by MATH thread. The lock had to be previously acquired with tile_regs_acquire.
ttkernel.tile_regs_release
(tt::ttkernel::TileRegsReleaseOp)
Tile_regs_release
Release lock on DST register by PACK thread. The lock had to be previously acquired with tile_regs_wait.
ttkernel.tile_regs_wait
(tt::ttkernel::TileRegsWaitOp)
Tile_regs_wait
Acquire an exclusive lock on the DST register for the PACK thread. It waits for the MATH thread to commit the DST register. This is a blocking function, i.e. this function will wait until the lock is acquired.
ttkernel.tilize_block
(tt::ttkernel::TilizeBlockOp)
TilizeBlockOp call.
TilizeBlockOp operation
Operands:
Operand | Description |
---|---|
cbIn | TTKernel cb |
numTiles | 32-bit signless integer |
cbOut | TTKernel cb |
ttkernel.tilize_init
(tt::ttkernel::TilizeInitOp)
TilizeInitOp call.
TilizeInitOp operation
Operands:
Operand | Description |
---|---|
cbIn | TTKernel cb |
numTiles | 32-bit signless integer |
cbOut | TTKernel cb |
ttkernel.unary_op_init_common
(tt::ttkernel::UnaryOpInitCommonOp)
Initialization function for unary operations.
This operation initializes all necessary components for unary operations, including unpacking, packing, and math configurations.
Operands:
Operand | Description |
---|---|
icb | TTKernel cb |
ocb | TTKernel cb |
ttkernel.unpack_ab
(tt::ttkernel::UnpackABOp)
UnpackAB op.
UnpackAB operation
Operands:
Operand | Description |
---|---|
cb_a | TTKernel cb |
src_a_index | 32-bit signless integer |
cb_b | TTKernel cb |
src_b_index | 32-bit signless integer |
ttkernel.unpack_a
(tt::ttkernel::UnpackAOp)
UnpackA op.
UnpackA operation
Operands:
Operand | Description |
---|---|
cb | TTKernel cb |
src_index | 32-bit signless integer |
ttkernel.unpack_set_data_type
(tt::ttkernel::UnpackSetDataTypeOp)
Unpack set DataType op.
Unpack set DataType operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
data_type_a | ::mlir::IntegerAttr | TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}} |
data_type_b | ::mlir::IntegerAttr | TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}} |
ttkernel.unreachable
(tt::ttkernel::UnreachableOp)
Unreachable op.
Unreachable operation
Traits: AlwaysSpeculatableImplTrait
, ReturnLike
, Terminator
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
, RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
ttkernel.untilize_block
(tt::ttkernel::UntilizeBlockOp)
UntilizeBlockOp call.
UntilizeBlockOp operation
Operands:
Operand | Description |
---|---|
cbIn | TTKernel cb |
numTiles | 32-bit signless integer |
cbOut | TTKernel cb |
ttkernel.untilize_init
(tt::ttkernel::UntilizeInitOp)
UntilizeInitOp call.
UntilizeInitOp operation
Operands:
Operand | Description |
---|---|
cbIn | TTKernel cb |
cbOut | TTKernel cb |
CBType
TTKernel cb
Syntax:
!ttkernel.cb<
CBPort, # port
uint64_t, # address
MemRefType, # memref
uint64_t, # page_size
uint64_t # num_buffers
>
Circular buffer type in TTKernel dialect
Parameters:
Parameter | C++ type | Description |
---|---|---|
port | CBPort | |
address | uint64_t | |
memref | MemRefType | |
page_size | uint64_t | |
num_buffers | uint64_t |
L1AddrType
TTKernel l1 address
Syntax: !ttkernel.l1_addr
L1 address type in TTKernel dialect
L1AddrPtrType
TTKernel l1 address pointer
Syntax: !ttkernel.l1_addr_ptr
L1 pointer address type in TTKernel dialect
NocAddrType
TTKernel noc address
Syntax: !ttkernel.noc_addr
Noc address type in TTKernel dialect
'ttmetal' Dialect
A TTMetal out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.
[TOC]
CoreRangeAttr
TTMetal grid attribute
Syntax:
#ttmetal.core_range<
::llvm::ArrayRef<int64_t>, # offset
::llvm::ArrayRef<int64_t> # size
>
TTMetal grid attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
offset | ::llvm::ArrayRef<int64_t> | |
size | ::llvm::ArrayRef<int64_t> |
ttmetal.alloc
(tt::ttmetal::AllocOp)
Alloc op.
Tensor Alloc operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
address | ::mlir::IntegerAttr | 64-bit signless integer attribute |
size | ::mlir::IntegerAttr | 64-bit signless integer attribute |
memory_space | ::mlir::tt::MemorySpaceAttr | TT MemorySpace{{% markdown %}}Enum cases: * system (`System`) * mmio (`SystemMMIO`) * dram (`DeviceDRAM`) * l1 (`DeviceL1`){{% /markdown %}} |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttmetal.dealloc
(tt::ttmetal::DeallocOp)
Dealloc op.
Tensor Dealloc operation
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
ttmetal.dispatch
(tt::ttmetal::DispatchOp)
Dispatch op.
Dispatch operation
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
core_ranges | ::mlir::ArrayAttr | |
kernelConfigs | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttmetal.host_read
(tt::ttmetal::HostReadOp)
Host read op.
Host read operation
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttmetal.host_write
(tt::ttmetal::HostWriteOp)
Host write op.
Host write operation
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
value | ::mlir::ElementsAttr | constant vector/tensor attribute |
Operands:
Operand | Description |
---|---|
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
'ttnn' Dialect
A TTNN out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.
[TOC]
ttnn.leaky_relu
(tt::ttnn::LeakyReluOp)
Eltwise leaky relu operation.
The Leaky ReLU (Rectified Linear Unit) operation computes an element-wise activation function over its input tensor. It is defined as:
y = x if x > 0 y = parameter * x if x <= 0
where parameter
is a small, user-defined constant that determines the slope for
negative inputs.
Attributes:
parameter
(float): The slope for negative values.
Inputs:
input
(Tensor): The input tensor to be activated.
Outputs:
output
(Tensor): The tensor after applying the Leaky ReLU activation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
parameter | ::mlir::FloatAttr | 32-bit float attribute |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.abs
(tt::ttnn::AbsOp)
Eltwise absolute.
Eltwise absolute operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.add
(tt::ttnn::AddOp)
Eltwise add.
Eltwise add operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.all_gather
(tt::ttnn::AllGatherOp)
All gather op.
Tensor All Gather operation
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim | ::mlir::IntegerAttr | 32-bit signed integer attribute |
num_links | ::mlir::IntegerAttr | 32-bit signed integer attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.alloc
(tt::ttnn::AllocOp)
Alloc op.
Tensor Alloc operation
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
address | ::mlir::IntegerAttr | 64-bit signless integer attribute |
size | ::mlir::IntegerAttr | 64-bit signless integer attribute |
buffer_type | ::mlir::tt::ttnn::BufferTypeAttr | TTNN Buffer Type{{% markdown %}}Enum cases: * dram (`DRAM`) * l1 (`L1`) * system_memory (`SystemMemory`) * l1_small (`L1Small`) * trace (`Trace`){{% /markdown %}} |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.cbrt
(tt::ttnn::CbrtOp)
Eltwise cubic root.
Eltwise cubic root operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.ceil
(tt::ttnn::CeilOp)
Eltwise ceil.
Eltwise ceil operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.clamp
(tt::ttnn::ClampOp)
Clamp op.
Clamp tensor values to a specified range.
Example: min: 2.000000+00 input: [[0, 1, 2, 3, 4, 5, 6, 7]] max: 5.000000+00
"ttnn.clamp"(%arg0) <{max = 2.000000e+00 : f32, min = 5.000000e+00 : f32}> -> %out = [[2, 2, 2, 3, 4, 5, 5, 5]]
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
min | ::mlir::FloatAttr | 32-bit float attribute |
max | ::mlir::FloatAttr | 32-bit float attribute |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | variadic of ranked tensor of any type values |
ttnn.concat
(tt::ttnn::ConcatOp)
Concat op.
Concat tensors along a given dimension.
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim | ::mlir::IntegerAttr | 32-bit signed integer attribute |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.conv2d
(tt::ttnn::Conv2dOp)
Conv2d operation.
Applies a 2D convolution over an input image composed of several input planes.
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
in_channels | ::mlir::IntegerAttr | 32-bit signless integer attribute |
out_channels | ::mlir::IntegerAttr | 32-bit signless integer attribute |
batch_size | ::mlir::IntegerAttr | 32-bit signless integer attribute |
input_height | ::mlir::IntegerAttr | 32-bit signless integer attribute |
input_width | ::mlir::IntegerAttr | 32-bit signless integer attribute |
kernel_height | ::mlir::IntegerAttr | 32-bit signless integer attribute |
kernel_width | ::mlir::IntegerAttr | 32-bit signless integer attribute |
stride_height | ::mlir::IntegerAttr | 32-bit signless integer attribute |
stride_width | ::mlir::IntegerAttr | 32-bit signless integer attribute |
padding_height | ::mlir::IntegerAttr | 32-bit signless integer attribute |
padding_width | ::mlir::IntegerAttr | 32-bit signless integer attribute |
dilation_height | ::mlir::IntegerAttr | 32-bit signless integer attribute |
dilation_width | ::mlir::IntegerAttr | 32-bit signless integer attribute |
groups | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
weight | ranked tensor of any type values |
bias | ranked tensor of any type values |
output | ranked tensor of any type values |
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.cos
(tt::ttnn::CosOp)
Eltwise cosine.
Eltwise cosine operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.deallocate
(tt::ttnn::DeallocateOp)
Deallocate op.
Tensor Deallocate operation
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
force | ::mlir::BoolAttr | bool attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
ttnn.div
(tt::ttnn::DivOp)
Eltwise divide.
Eltwise divide operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.embedding
(tt::ttnn::EmbeddingOp)
Embedding op.
Embedding operation.
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
weight | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.empty
(tt::ttnn::EmptyOp)
Empty op.
Tensor empty operation
Interfaces: NoMemoryEffect (MemoryEffectOpInterface)
, TTNN_OpModelInterface
Effects: MemoryEffects::Effect{}
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
shape | ::mlir::tt::ttnn::ShapeAttr | TTNN Shape attribute{{% markdown %}} TTNN shape attribute {{% /markdown %}} |
dtype | ::mlir::tt::DataTypeAttr | TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}} |
layout | ::mlir::tt::ttnn::LayoutAttr | TTNN Layout{{% markdown %}}Enum cases: * row_major (`RowMajor`) * tile (`Tile`) * invalid (`Invalid`){{% /markdown %}} |
memory_config | ::mlir::tt::ttnn::MemoryConfigAttr | TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}} |
Operands:
Operand | Description |
---|---|
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.eq
(tt::ttnn::EqualOp)
Eltwise equal to.
Eltwise equal to operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.exp
(tt::ttnn::ExpOp)
Eltwise exponential.
Eltwise exponential operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.expm1
(tt::ttnn::Expm1Op)
Eltwise unary op.
Performs element-wise exponential minus one operation on operand
tensor
and stores the result in the output tensor.
Example: %a: [[0, 1], [0, 0]] "ttnn.exmp1"(%a, %out) -> %out: [[0, 1.71828], [0, 0]]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.floor
(tt::ttnn::FloorOp)
Eltwise floor op.
Eltwise floor operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.from_device
(tt::ttnn::FromDeviceOp)
FromDevice op.
This op retrieves the input tensor from the given device.
Interfaces: TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.full
(tt::ttnn::FullOp)
Full op.
Tensor full operation
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
fillValue | ::mlir::FloatAttr | 32-bit float attribute |
Operands:
Operand | Description |
---|---|
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.gelu
(tt::ttnn::GeluOp)
Eltwise GELU.
Eltwise GELU operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.get_device
(tt::ttnn::GetDeviceOp)
Get Device op.
This op returns the current runtime device.
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
mesh_shape | ::mlir::tt::ttnn::MeshShapeAttr | TTNN Mesh Shape{{% markdown %}} TTNN mesh shape {{% /markdown %}} |
Results:
Result | Description |
---|---|
device | TT device |
ttnn.ge
(tt::ttnn::GreaterEqualOp)
Eltwise greater than or equal to.
Eltwise greater than or equal to operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.gt
(tt::ttnn::GreaterThanOp)
Eltwise greater than.
Eltwise greater than operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.isfinite
(tt::ttnn::IsFiniteOp)
Eltwise isfinite op.
Eltwise isfinite operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.le
(tt::ttnn::LessEqualOp)
Eltwise less than or equal to.
Eltwise less than or equal to operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.lt
(tt::ttnn::LessThanOp)
Eltwise less than.
Eltwise less than operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.log1p
(tt::ttnn::Log1pOp)
Eltwise log1p operation.
Performs element-wise logarithm plus one operation on operand
tensor and
puts the result in the output tensor.
Example: %a: [0.0, -0.999, 7.0, 6.38905621, 15.0] "ttnn.logp1"(%a, %out) -> %out: [0.0, -6.90776825, 2.07944155, 2.0, 2.77258873]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.log
(tt::ttnn::LogOp)
Eltwise logarithm.
Eltwise logarithm operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.logical_and
(tt::ttnn::LogicalAndOp)
Eltwise logical and.
Eltwise logical and operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.logical_not
(tt::ttnn::LogicalNotOp)
Eltwise logical not op.
Eltwise logical not operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.logical_or
(tt::ttnn::LogicalOrOp)
Eltwise logical or.
Eltwise logical or operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.logical_xor
(tt::ttnn::LogicalXorOp)
Eltwise logical xor.
Eltwise logical xor operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.matmul
(tt::ttnn::MatmulOp)
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
a | ranked tensor of any type values |
b | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.max
(tt::ttnn::MaxOp)
Max reduction op.
Max reduction op.
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.max_pool2d
(tt::ttnn::MaxPool2dOp)
Applies a 2D max pooling over an input signal composed of several input planes.
Applies a 2D max pooling over an input signal composed of several input planes.
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
batch_size | ::mlir::IntegerAttr | 32-bit signed integer attribute |
input_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
input_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
channels | ::mlir::IntegerAttr | 32-bit signed integer attribute |
kernel_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
kernel_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
stride_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
stride_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
ceil_mode | ::mlir::BoolAttr | bool attribute |
padding_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.maximum
(tt::ttnn::MaximumOp)
Eltwise maximum OP.
Calculates maximum of input tensors' values element-wise and stores result in output tensor.
Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttnn.maximum"(%lhs, %rhs, %out) -> %out: [[3, 4, 7], [1, 4, 4]]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.mean
(tt::ttnn::MeanOp)
Mean reduction op.
Mean reduction op.
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.minimum
(tt::ttnn::MinimumOp)
Eltwise minimum OP.
Calculates minimum of input tensors' values element-wise and stores result in output tensor.
Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttnn.minimum"(%lhs, %rhs, %out) -> %out: [[1, 2, 2], [1, 2, 3]]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.multiply
(tt::ttnn::MultiplyOp)
Eltwise multiply.
Eltwise multiply operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.neg
(tt::ttnn::NegOp)
Eltwise negate.
Eltwise negate operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.ne
(tt::ttnn::NotEqualOp)
Eltwise not equal to.
Eltwise not equal to operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.reciprocal
(tt::ttnn::ReciprocalOp)
Eltwise reciprocal.
Eltwise reciprocal operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.reduce_scatter
(tt::ttnn::ReduceScatterOp)
Reduce scatter op.
Tensor Reduce Scatter operation
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
scatter_split_dim | ::mlir::IntegerAttr | 32-bit signed integer attribute |
math_op | ::mlir::IntegerAttr | TTNN Reduce Operation Type{{% markdown %}}Enum cases: * sum (`Sum`) * mean (`Mean`) * max (`Max`) * min (`Min`) * std (`Std`) * var (`Var`){{% /markdown %}} |
num_links | ::mlir::IntegerAttr | 32-bit signed integer attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.relu
(tt::ttnn::ReluOp)
Eltwise ReLU.
Eltwise ReLU operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, OpModel
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.remainder
(tt::ttnn::RemainderOp)
Eltwise remainder.
Performs element-wise remainder of dividend lhs and divisor rhs tensors and produces a result tensor.
Example:
// %lhs: [17, -17, 17, -17] // %rhs: [3, 3, -3, -3] %result = "ttnn.remainder"(%lhs, %rhs) : (tensor<4xi64>, tensor<4xi64>) -> tensor<4xi64> // %result: [2, -2, 2, -2]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.reshape
(tt::ttnn::ReshapeOp)
Reshape op.
Reshape tensor.
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
shape | ::mlir::ArrayAttr | 32-bit integer array attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.rsqrt
(tt::ttnn::RsqrtOp)
Eltwise rsqrt.
Eltwise rsqrt operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.sigmoid
(tt::ttnn::SigmoidOp)
Eltwise sigmoid.
Eltwise sigmoid operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.sign
(tt::ttnn::SignOp)
Eltwise sign operation.
Returns the sign of the operand
element-wise and produces a result
tensor.
Example: %a: [[3, -2, 0], [1, -4, 4]] "ttnn.sign"(%a, %out) -> %out: [[1, -1, 0], [1, -1, 1]]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.sin
(tt::ttnn::SinOp)
Eltwise sine.
Eltwise sine operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.slice
(tt::ttnn::SliceOp)
Slice op.
Extract a portion of a tensor based on the specified start (begins
), stop (ends
), and step
indices for each dimension.
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
begins | ::mlir::ArrayAttr | 32-bit integer array attribute |
ends | ::mlir::ArrayAttr | 32-bit integer array attribute |
step | ::mlir::ArrayAttr | 32-bit integer array attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.softmax
(tt::ttnn::SoftmaxOp)
Softmax op.
Softmax operation.
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::IntegerAttr | 32-bit signed integer attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.sqrt
(tt::ttnn::SqrtOp)
Eltwise sqrt.
Eltwise sqrt operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.subtract
(tt::ttnn::SubtractOp)
Eltwise subtract.
Eltwise subtract operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.sum
(tt::ttnn::SumOp)
Sum reduction op.
Sum reduction op.
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.to_device
(tt::ttnn::ToDeviceOp)
ToDevice op.
This op sends the input tensor to the given device with the given memory config.
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
memory_config | ::mlir::tt::ttnn::MemoryConfigAttr | TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}} |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.to_layout
(tt::ttnn::ToLayoutOp)
ToLayout op.
This op wraps all layout information gathered from ttir.toLayout. It is used/updated by the optimizer to perform optimizations, and later broken down into specific memory/layout operations (toDevice, toMemoryConfig etc.). Currently in the TTNN backend, we use this op solely for tilize/untilize, therefore marking all other attrs as optional. Once ttnn::to_layout supports other attrs, we can remove the optional tag.
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
layout | ::mlir::tt::ttnn::LayoutAttr | TTNN Layout{{% markdown %}}Enum cases: * row_major (`RowMajor`) * tile (`Tile`) * invalid (`Invalid`){{% /markdown %}} |
dtype | ::mlir::tt::DataTypeAttr | TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}} |
memory_config | ::mlir::tt::ttnn::MemoryConfigAttr | TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}} |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.to_memory_config
(tt::ttnn::ToMemoryConfigOp)
ToMemoryConfig op.
This op converts the memory config of the input tensor based on the given memory config. It handles:
- Dram to L1
- L1 to Dram
- Interleaved to sharded
- Sharded to interleaved
- Sharded to sharded (reshard)
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
memory_config | ::mlir::tt::ttnn::MemoryConfigAttr | TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}} |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.transpose
(tt::ttnn::TransposeOp)
Transpose op.
Transpose tensor along two given dimensions.
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim0 | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dim1 | ::mlir::IntegerAttr | 32-bit signed integer attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.typecast
(tt::ttnn::TypecastOp)
Typecast op.
This op converts the data type of the input tensor based on the given data type. It handles:
- conversions of data types.
Interfaces: TTNN_OpModelInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dtype | ::mlir::tt::DataTypeAttr | TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}} |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.where
(tt::ttnn::WhereOp)
Eltwise where.
Eltwise where operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTNN_OpModelInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |