Introduction
The following document provides an overview of the TT-MLIR project, with a focus on the technical specifications of an MLIR-based compiler stack. So what exactly is an MLIR-based compiler stack? MLIR (Multi Level Intermediate Representation) is a subproject coming out of the LLVM Project. It seeks to introduce extensibility and sustainable code design to a very modular compiler framework. This essentially means to take a much larger more involved compiler (like LLVM) and split it into sub-compilers that each produce their own Intermediate Representation (IR) of what you've fed the compiler.
Disclaimer: This is intended to be a working document, if you find something incorrect or incomplete please feel free to create a PR.
Motivations
The idea of having a multi-level IR might not seem so far fetched, in fact it resembles some of our current software stacks. The idea of going from a High Level TVM Graph → Lowered PyBUDA Graph → Netlist, with each layer having their own level of optimizations is quite a familiar concept. However, there are problems with the reusability and integration of optimizations for the current software compiler stack. Currently, users are almost forced to choose between a top-down optimization or bottom-up optimization, with both requiring "expert-level" expertise to optimize for desired performance. Developing 2 entirely different projects is taxing, and it's hard to translate the benefits of BUDA over to metal (or the other way around). One of the primary goals of tt-mlir is to enable a consistent programming model between software stacks, concepts for improving optimizations in the compiler stack should 1-1 carry over to hand-written TTNN.
The benefits grow even further when one can understand all the possible entry points that multiple IRs present. Existing MLIR based projects like OpenXLA and torch-mlir can natively output MLIR in a dialect that can be transcribed to the TTIR dialect as well!
What is MLIR and why use it?
MLIR is a compiler infrastructure that is designed to be modular and extensible. The main benefits the tt-mlir project hopes to gain by using MLIR include:
- Industry Standard Compiler Framework
- Lots of boilerplate algorithms, data structures, and useful software that is common to compiler development
- Ecosystem
- Hook into existing front-end MLIR projects
- Testing framework
- A battle-tested test infrastructure that will enable us to write fine grained tests and rely less on end-to-end testing
- Common IR Serialization Format that's easy to test, debug, and edit
Additional documentation to highlight the benefits of MLIR can be found here:
MLIR: Overview
MLIR is at it's root an interpreter that can parse "readable" text in some .mlir format. The unique properties lie in the modularity of the parsing itself. MLIR is built upon a collection of Dialects, each of these Dialects define a collection of Operations, Types, and Attributes. These dialects follow their own syntax, and they can encode any amount of information. The benefit is that MLIR provides bindings and hooks such that a user can directly translate these IRs into usable artifacts for that layer of complexity. An example of this would be the relatively high level TOSA Dialect, which is used to represent computation over tensors, and then lowering that to a more hardware specific dialect that closely models the programming model of the hardware or underlying backend. It is the dialect system itself which powers the multi-level functionality of MLIR, with different dialects a user can essentially "lower" through their software stack by just transforming between the different dialects for their layers. Dialects can exist in a broad range from purely mathematical dialects, to a LinAlg Dialect, or a Tensorflow Dialect defined for ML Graphs. Each dialect encodes it's own information and their operations can use the Types/Attributes of other dialects as parameters. Multiple dialects are possible in one module, and encouraged to highlight optimizations of different dialects. In our usecase for the TT Stack, MLIR acts a "mid-level" compiler which makes the task of joining together various entry points and backends much simpler.
MLIR Primitives
So what does MLIR look like, how does it work and get parsed? The hierarchy of an MLIR Module is as shown:
#any_device = #tt.operand_constraint<dram|l1|scalar|tile|any_device|any_device_tile>
module attributes {tt.system_desc = #tt.system_desc<[<#tt.arch<wormhole_b0>, #tt.grid<8x8>>], [0], [<pcie|host_mmio>], [<0, 0, 0, 0>]>} {
func.func @forward(%arg0: tensor<64x128xf32>, %arg1: tensor<64x128xf32>) -> tensor<64x128xf32> {
%0 = tensor.empty() : tensor<64x128xf32>
%1 = "ttir.multiply"(%arg0, %arg1, %0) <{operandSegmentSizes = array<i32: 2, 1>, operand_constraints = [#any_device, #any_device, #any_device]}> : (tensor<64x128xf32>, tensor<64x128xf32>, tensor<64x128xf32>) -> tensor<64x128xf32>
return %1 : tensor<64x128xf32>
}
}
-
Attributes (defined using #)
- The syntax of actually creating an attribute is modular, and custom assembly instructions for different attributes can be applied.
-
Operations
-
These operations are accessed with the . method, so you'll see some examples like func.func or tensor.empty. Each operation also provides it's own assembly instructions but often strictly defines the type of result
-
Quotes are added around ttir.multiply since it's part of a custom dialect, and more custom assembly instructions are applied to specify operand_constraints.
-
Operations typically have operands (arguments) and results which are highlighted with %, these results and operands help to show the relationship between operations
-
-
Types
-
Types are shown as dataformats throughout this compiled mlir module, where tensor and array are some examples.
-
They help to demonstrate the transformation of information and it's representation as it's processed across this module.
-
MLIR Workflow
The overall MLIR workflow doesn't involve writing .mlir files, not necessarily even modifying them. The Intermediate Representations are truly just representations, we can parse them to demonstrate what the graph looks like at that current stage of optimization, or run a pass through them to optimize certain functions. The overall framework is designed with the following architecture in mind:
-
Graph Information exists
-
Graph Information is transformed (through any which method) into a high-level MLIR representation
-
Passes are run on the high-level implementation to lower into TTIR, a common IR that can be lowered into multiple backends
-
Depending on the usecase more passes are run to lower to whatever backend the user would like (ex: TTNN Backend)
What are Passes?
Transformations in MLIR are represented as passes that occur during the parsing of some information. These passes can be executed when parsing or generating MLIR modules. These transformations can have a myriad of purposes, and are completely user defined as to how they modify the module. Some examples of passes can be for lowering purposes as mentioned before, where a dialect is parsed and then each operation is transformed to a lowered dialect following some set of user defined rules. Passes are also used for optimizations and backend code transformation in the context of this project. They're a powerful tool and provide most of the functionality to transform between layers of dialects, and they provide a simple platform for modifications of an MLIR module.
Why not make our own?
Now that I've described the functionality of the MLIR framework, it seems like making an in house multi level Intermediate Representation system would be pretty similar, so why are we going through the effort of implementing this framework?
One of the biggest reason can be attributed to the active developer community surrounding the project, being a part of the LLVM Project means that there is solid developer support, and the framework is designed to be a tool for many different paradigms of compute. This scalability and strong mission statement lend to the strengths of MLIR being a solid platform to use as a middle layer in our compiler stack. Furthermore, as a functional benefit of being part of a larger open source project, MLIR has a whole library of tests and infrastructure that we can leverage for solid code health while starting a new project.
Automation
It's not only about developer support, another key benefit of MLIR is that it's built with autogeneration in mind. Through TableGen a lot of the boilerplate of creating this multi-level IR become abstracted away to truly focus on implementation and execution. This automation is built on top of a pre-existing robust framework with a lot of implementations and support from other large players in the ML scene. By integrating with these automation pipelines, we allow for external developers to have a much simpler entry-point into our software stack!
TT-MLIR: Bringing MLIR to the TT Stack
Now that we have defined this pretty cool project, let's look at the implementation details of bringing MLIR (and related optimizations) into the TT Stack. Since it acts as a mid-level compiler we can start by defining the "bottom" and "top" layers of the compiler. BUDA already has a well defined set of frontend optimizations to some TVM defined graph and is knowledgeable of the hardware that these models want to run on. We want to interrupt the BUDA stack to only give us the frontend compiled graph before any hardware specific lowering is to occur. What this will produce is information that is agnostic to different backends and their execution on TT hardware, but this is still valid information to optimize at different levels for later compilation. The "bottom" of our graph is now defined as the backend that will produce the machine-specific code to be executed. While MLIR could allow for any level of complexity downwards for the bottom, we will define a very aggressive TTNN backend for the MVP. Desired Optimization List:
-
Forge-FE (frontend)
- Graph Optimizations, Constant Folding, Operation Fusion
-
TT-MLIR (mid-level)
- Data Storage, Memory Configuration, Grid Configuration
-
TT-NN (backend)
- Kernel Configuration*, Network Optimization
*Subject to Change / Be Moved to TT-MLIR
TT-MLIR Dialects
Now that we have defined the series of optimizations that we would like to see implemented in TT-MLIR, we can begin to help define the dialects that would help to support these different levels of optimizations. For more detail on each of these dialects, please refer to the GitHub Wiki and TableGen descriptors. I think that Nick does a great job of documenting the key functionality.
TT Dialect
The TT Dialect is only for common Types and Attributes used throughout the many levels of the mid level compiler.
TTIR Dialect
The TTIR Dialect is defined as the common dialect for TT-MLIR, as such it doesn't define anything hardware/backend specific. It lists out general actions that would take place on TT hardware such as dispatch, layout, and kernel operations.
Generic Operation
This is one of two operations that's crucial to understand the intended optimization characteristics of the TTIR Dialect. The generic operation dictates the actions that would be taken to dispatch some instruction to TT hardware such that it executes some instruction. Parametrically, the operation consumes inputs, outputs, maps to read the tensors, and access-types to the memory. These parameters highlight the optimizations that can be performed at this level to change the location of the memory, transpose using variant access maps, or even the grid upon which the computation takes place. The operation also contains a block in which the exact behaviour for that operation to occur is stored.
Layout Operation
The layout operation is key in describing the storage of memory throughout the execution graph. Layout determines the sharding spec, location of the memory, data types, and tile sizes of some tensor. While generic describes the dispatch for some data-wise transformation to take place, the data itself is laid out across the chip through the layout operation.
Both of these operations describe the key functionality of the TTIR dialect and the optimization space that it provides.
Built-in MLIR Dialects
The functionality of TT-MLIR Dialects also depends / is inspired by the functionality of Built-in MLIR Dialects like Affine and LinAlg. Below are summaries of some of the key members of these Dialects
Affine Dialect
[Reference] Affine maps help to describe transformations on coordinate systems, while this may not really make sense, imagine trying to index a rank 2 tensor. By getting t[x, y] I can access the element in the Xth row and Yth column, but if I wanted to transpose the tensor I might have to re-layout the entire tensor such that the data would be accessible using t[x, y] to get the element in the Yth row and Xth column. This transpose can also be represented using an Affine Map to transform (x, y) -> (y, x) and this would let the tensor data remain in place while the access method is modified. This extends even further to more complex transformations such that stride lengths or unique indexing methods can be implemented without complicated manipulation.
Tensor Dialect
[Reference] The tensor dialect defines the functionality and Type of the fundamental Tensor. This dialect contains members that would represent manipulation and representation of tensors as multi-dimensional data with shapes and datatypes. Not much else is different about this dialect, the reference covers key topics if implementation details are needed.
Func Dialect
TOSA Dialect
SCF Dialect
EmitC Dialect
TT-Explorer - Performance Optimization Tool
A unique project related to TT-MLIR is the integration of Performance Optimization Tools such that users are easily able to visualize and readily tune their models without needing an expert level understanding of the tech stack. TT-Explorer is built with Google AI's Model Explorer as a base for the visualization tool, and a custom adapter to parse TT-MLIR projects. This would allow users to readily tune their models, and optimize for the TTIR layer (ex: they can change certain memory to be laid out in L1 instead of DRAM, or change the grid layout of an operation to be larger than what was previously assigned). After compilation with these overrides, the runtime information can then be fed directly into a Tracy Performance Analysis for the user to visualize the impacts of their tuning, seeing which operations were least performant and continuing in a gamified design loop for iterative performance tuning!
Building
These are the steps required to get the TT-MLIR project running on your machine
Please refer to the Dependencies section before building the project.
Environment setup
You only need to build this once, it builds llvm, flatbuffers and a python virtual environment.
cmake -B env/build env
cmake --build env/build
- It is recommended to use the system installation of python3 for the virtual environment. Please ensure that you do not already have a venv activated before running the above command.
- Please ensure the directory
/opt/ttmlir-toolchain
exist and its owner is the current user, i.e. the one that executes the abovecmake
commands. The commands create it and assign the proper ownership are:sudo mkdir -p /opt/ttmlir-toolchain sudo chown -R $USER /opt/ttmlir-toolchain
Build
source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17
cmake --build build
- To enable the ttnn/metal runtime add
-DTTMLIR_ENABLE_RUNTIME=ON
- Clang 17 is the minimum required version when enabling the runtime.
- To enable the ttnn/metal perf runtime add
-DTT_RUNTIME_ENABLE_PERF_TRACE=ON
- To accelerate the builds with ccache use
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache
- To accelerate builds further, if python bindings aren't needed,
-DTTMLIR_ENABLE_BINDINGS_PYTHON=OFF
. For some reason the python bindings link step is very slow.- TTNN build is automatically integrated / handled by tt-mlir cmake build system. For debugging and further information regarding the TTNN backend build step, please refer to TTNN Documentation.
- The runtime build step depends on the
ARCH_NAME
environment variable, which is set in theenv/activate
script. If you want to build the runtime for a different architecture, please setARCH_NAME
to the desired value (one ofgrayskull
,wormhole_b0
, orblackhole
). Please note that the runtime is built only ifTTMLIR_ENABLE_RUNTIME=ON
.- In addition to
ARCH_NAME
, the runtime build depends onTT_METAL_HOME
variable, which is also set inenv/activate
script. For more information, please refer to TT-NN and TT-Metailium installation documentation.
OS | Offline Compiler Only | Runtime Enabled Build | Runtime + Perf Enabled Build |
---|---|---|---|
Ubuntu 22.04 | ✅ | ✅ | ✅ |
Ubuntu 20.04 | ✅ | ❌ | ❌ |
MacOS | ✅ | ❌ | ❌ |
Test
source env/activate
cmake --build build -- check-ttmlir
llvm-lit
Under the hood the check-ttmlir cmake target is running llvm-lit
. With it you
can:
# Query which tests are available
llvm-lit -sv ./build/test --show-tests
# Run an individual test:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR/test_allocate.mlir
# Run a sub-suite:
llvm-lit -sv ./build/test/ttmlir/Dialect/TTIR
See the full llvm-lit documentation for more information.
Lint
source env/activate
cmake --build build -- clang-tidy
Pre-Commit
Pre-Commit applies a git hook to the local repository such that linting is checked and applied on every git commit
action. Install from the root of the repository using:
source env/activate
pre-commit install
If you have already committed before installing the pre-commit hooks, you can run on all files to "catch up":
pre-commit run --all-files
For more information visit pre-commit
Docs
source env/activate
cmake --build build -- docs
mdbook serve build/docs
mdbook
can be installed with the system's package manager.mdbook serve
will by default create a local server athttp://localhost:3000
.
Note: If you want to build the docs on MacOS, there are two extra dependencies:
Both can be installed using Homebrew by running the following commands:
brew install doxygen
brew install graphviz
Dependencies
Ubuntu Common
Make sure to have Git LFS installed. You can install it with the following command:
sudo apt-get install git-lfs
Ubuntu 22.04
On Ubuntu 22.04 we need to install clang, ninja, and to update the version of cmake because 3.20 is the minimum required for this project.
sudo apt update
sudo apt upgrade
sudo apt install clang-17
sudo apt install ninja-build
sudo apt remove cmake -y
pip3 install cmake --upgrade
hash -r
Ensure cmake can by found in this path pip installed it to. E.g.
PATH=$PATH:$HOME/.local/bin
Then run the following command to see the cmake version which should be later than 3.20
cmake --version
We also need to install Ninja which can be done with the following command
sudo apt install ninja-build
MacOS
On MacOS we need to install the latest version of cmake, and ninja which can be done using Homebrew with (Docs for installing Homebrew: https://brew.sh).
brew install cmake
brew install ninja
Common Build Errors
TTMLIRPythonCAPI target requires changing an RPATH
CMake Error at /opt/ttmlir-toolchain/lib/cmake/llvm/AddLLVM.cmake:594 (add_library):
The install of the TTMLIRPythonCAPI target requires changing an RPATH from
the build tree, but this is not supported with the Ninja generator unless
on an ELF-based or XCOFF-based platform. The
CMAKE_BUILD_WITH_INSTALL_RPATH variable may be set to avoid this relinking
step.
If you get the above error, it means you tried to build with an old version of cmake or ninja and there is a stale file. To fix this, rm -rf
your build directory, install a newer version of cmake/ninja, and then rebuild. If you installed ninja via sudo apt install ninja-build
, it might still be not up-to-date (v1.10.0). You may use ninja in the python virtual environment, or install it via pip3 install -U ninja
, either way the version 1.11.1.git.kitware.jobserver-1
should work.
clang++ is not a full path and was not found in the PATH
CMake Error at CMakeLists.txt:2 (project):
The CMAKE_CXX_COMPILER:
clang++
is not a full path and was not found in the PATH.
Tell CMake where to find the compiler by setting either the environment
variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
to the compiler, or to the compiler name if it is in the PATH.
CMake Error at CMakeLists.txt:2 (project):
The CMAKE_C_COMPILER:
clang
is not a full path and was not found in the PATH.
Tell CMake where to find the compiler by setting either the environment
variable "CC" or the CMake cache entry CMAKE_C_COMPILER to the full path to
the compiler, or to the compiler name if it is in the PATH.
If you get the following error, it means you need to install clang which you can do with sudo apt install clang
on Ubuntu.
sfpi
, trisc
, ncrisc
build failure
tt-forge-fe/third_party/tt-mlir/third_party/tt-metal/src/tt-metal/tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-g++: 1: version: not found
tt-forge-fe/third_party/tt-mlir/third_party/tt-metal/src/tt-metal/tt_metal/third_party/sfpi/compiler/bin/riscv32-unknown-elf-g++: 2: oid: not found
size: '1961632': No such file
size: '1961632': No such file
size: '1961632': No such file
Always | FATAL | ncrisc build failed
If you got the above error, it means that SFPI or similar component build failed. First, make sure you have GIT LFS setup (e.g. sudo apt-get install git-lfs). Then, try to pull SFPI submodule manually:
cd third_party/tt-metal/src/tt-metal
git submodule foreach 'git lfs fetch --all && git lfs pull'
Then, try to build again.
Common Runtime Errors
Debugging python on macOS
When debugging python on macOS via lldb you may see an error like:
(lldb) r
error: process exited with status -1 (attach failed (Not allowed to attach to process. Look in the console messages (Console.app), near the debugserver entries, when the attach failed. The subsystem that denied t
he attach permission will likely have logged an informative message about why it was denied.))
For preinstalled macOS binaries you must manually codesign with debug entitlements.
Create file debuggee-entitlement.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>com.apple.security.cs.disable-library-validation</key>
<true/>
<key>com.apple.security.get-task-allow</key>
<true/>
</dict>
</plist>
Sign the binary:
sudo codesign -f -s - --entitlements debuggee-entitlement.xml /opt/ttmlir-toolchain/venv/bin/python
Internal Build Notes / IRD
- When building the runtime we must use Ubuntu 22.04 docker image
- When making an IRD reservation use
--docker-image yyz-gitlab.local.tenstorrent.com:5005/tenstorrent/infra/ird-ubuntu-22-04-amd64:latest
- When making an IRD reservation use
- You'll have to manaully install a newer version of cmake, at least 3.22, the easiest way to do this is to
pip install cmake
and make sure this one is in your path - You'll want LLVM installation to persist IRD reservations, you can achieve this by:
- mkdir /localdev/$USER/ttmlir-toolchain
- When requesting an IRD use
--volumes /localdev/$USER/ttmlir-toolchain:/opt/ttmlir-toolchain
Working with Docker Images
Components:
- Dockerfile
- Workflow for building Docker image
- Project build using Docker image
Overview
We use docker images to prepare project enviroment, install dependancies, tooling and prebuild toolchain. Project builds four docker images:
Base image tt-mlir-base-ubuntu-22-04
Dockerfile.base
CI image tt-mlir-ci-ubuntu-22-04
Dockerfile.ci
Base IRD image tt-mlir-base-ird-ubuntu-22-04
Dockerfile.ird
IRD image tt-mlir-ird-ubuntu-22-04
Dockerfile.ird
Base image starts with a supported base image (Ubuntu 22.04) and installs dependancies for project build. From there we build CI image that contains prebuild toolcahin and is used in CI to shoten the build time. IRD image contain dev tools like GDB, vim etc and shh and are use in IRD enviroments.
During the CI Docker build, the project is built and tests are run to ensure that everything is set up correctly. If any dependencies are missing, the Docker build will fail.
Building the Docker Image using GitHub Actions
The GitHub Actions workflow Build and Publish Docker Image builds the Docker images and uploads them to GitHub Packages at https://github.com/orgs/tenstorrent/packages?repo_name=tt-mlir. We use the git SHA we build from as the tag.
Building the Docker Image Locally
To test the changes and build the image locally, use the following command:
docker build -f .github/Dockerfile.base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ci-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird -build-args FROM_IMAGE=base -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-base-ubuntu-22-04:latest .
docker build -f .github/Dockerfile.ird -build-args FROM_IMAGE=ci -t ghcr.io/tenstorrent/tt-mlir/tt-mlir-ird-ubuntu-22-04:latest .
Using the Image in GitHub Actions Jobs
The GitHub Actions workflow Build in Docker uses a Docker container for building:
container:
image: ghcr.io/${{ github.repository }}/tt-mlir-ci-ubuntu-22-04:latest
options: --user root
Tools
Currently, there are a few primary tools that are part of the ttmlir
project:
ttmlir-opt
: Thettmlir
optimizer driver. This tool is used to run thettmlir
compiler passes on a.mlir
source files and is central to developing and testing the compiler.ttrt
: This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files.tt-explorer
: Visualizer tool forttmlir
-powered compiler results. Visualizes from emitted.mlir
files to display compiled model, attributes, performance results, and provide a platform for human-driven overrides to gameify model tuning.
ttmlir-opt
The ttmlir
optimizer driver. This tool is used to run the ttmlir
compiler passes on a .mlir
source files and is central to developing and testing the compiler.
Simple Test
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir
# Or
./build/bin/ttmlir-opt --ttir-to-ttmetal-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir
ttmlir-translate
The ttmlir-translate
translation utility. Unlike ttmlir-opt
tool which is used to run passes within the MLIR world, ttmlir-translate
allows us to ingest something (e.g. code) into MLIR world, and also produce something (e.g. executable binary, or even code again) from MLIR.
Generate C++ code from MLIR
# First, let's run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-load-system-desc --ttir-layout --convert-ttir-to-ttnn --convert-ttnn-to-emitc test/ttmlir/Dialect/TTNN/simple_multiply.mlir -o c.mlir
# Now run `ttmlir-translate` to produce C++ code
./build/bin/ttmlir-translate -mlir-to-cpp c.mlir -allow-unregistered-dialect
Bonus: These two commands can be piped, to avoid writing a mlir
file to disk, like so:
./build/bin/ttmlir-opt --ttir-load-system-desc --ttir-layout --convert-ttir-to-ttnn --convert-ttnn-to-emitc test/ttmlir/Dialect/TTNN/simple_multiply.mlir | ./build/bin/ttmlir-translate -mlir-to-cpp -allow-unregistered-dialect
Generate flatbuffer file from MLIR
# First run `ttmlir-opt` to convert to proper dialect
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_multiply.mlir -o ttnn.mlir
# Now run `ttmlir-translate` to produce flatbuffer file
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn
ttrt
This tool is intended to be a swiss army knife for working with flatbuffers generated by the compiler. Its primary role is to inspect and run flatbuffer files. It enables the running of flatbuffer files without a front-end runtime.
Building
source env/activate
cmake --build build -- ttrt
ttrt --help
Installing ttrt as python whls (coming soon)
- Download whls
- Create a python venv
python -m venv ttrt_env
source ttrt_env/bin/activate
- Install whls
pip install *.whl
Building runtime mode
Add the following flags when building the compiler
-DTTMLIR_ENABLE_RUNTIME=ON
If you are building with runtime mode on with -DTTMLIR_ENABLE_RUNTIME=ON
, you will have to install the following packages when using ttrt
pip install torch
Building perf mode
Add the following flags when building the compiler
-DTTMLIR_ENABLE_RUNTIME=ON
-DTT_RUNTIME_ENABLE_PERF_TRACE=ON
If you are building with perf mode on with -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
, you will have to install the following packages when using ttrt
pip install torch
pip install loguru
pip install pandas
pip install seaborn
pip install graphviz
pip install pyyaml
pip install click
Generate a flatbuffer file from compiler
The compiler supports a pass to load a system descriptor to compile against. You can feed this pass into ttmlir-opt.
- Build ttmlir
- Build ttrt (see building section on this page)
- Generate ttsys file from the system you want to compile for using ttrt. This will create a
system_desc.ttsys
file underttrt-artifacts
folder.
ttrt query --save-artifacts
- Use ttmlir-opt tool in compiler to feed system descriptor. See the ttmlir-opt documentation for more information on how to generate .mlir files.
./build/bin/ttmlir-opt --ttir-load-system-desc="path=/path/to/system_desc.ttsys" --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
or (pip path directly into ttir-to-ttnn-backend-pipeline)
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline="system-desc-path=/path/to/system_desc.ttsys" test/ttmlir/Dialect/TTNN/simple_subtract.mlir -o ttnn.mlir
- Use ttmlir-translate tool in compiler to generate the flatbuffer executable. See the ttmlir-translate documentation for more information on how to generate flatbuffer files.
./build/bin/ttmlir-translate --ttnn-to-flatbuffer ttnn.mlir -o out.ttnn
- Run your test cases using ttrt
ttrt run /path/to/out.ttnn
Generate flatbuffer files using llvm-lit
There are already existing .mlir test cases under test/ttmlir/Silicon
. You can use llvm-lit tool to generate the corresponding ttnn and ttm files.
- Build ttmlir
- Build ttrt (see building section on this page)
- Generate ttsys file from the system you want to compile for using ttrt. This will create a
system_desc.ttsys
file underttrt-artifacts
folder.
ttrt query --save-artifacts
- Export this file in your environment using
export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys
. When llvm-lit is run, it will query this variable and generate the ttnn and ttm files using this system. Optionally, you can also provide this manually when running llvm-lit. - Generate your test cases. This will generate all your ttnn and ttm files under
build/test/ttmlir/Silicon
. ttnn files have a.ttnn
file extension and ttmetal files have a.ttm
extension.
cmake --build build -- check-ttmlir
- (Optional) If you have a single .mlir file (or a directory of custom .mlir files) that you created using the compiler, and you want to generate the corresponding ttnn and ttm files for it, you can run llvm-lit standalone to the path of your .mlir file or directory of .mlir files to generate the flatbuffer executables. You will have to make sure you add in the correct llvm-lit configs into your .mlir file. See section on adding llvm-lit config options inside a .mlir file to create flatbuffer binaries for more info. You must also make sure your .mlir test is found within test/ttmlir/Silicon folder (and point lit to the build folder)!
llvm-lit -v ./build/test/ttmlir/Silicon
or
SYSTEM_DESC_PATH=/path/to/system_desc.ttsys llvm-lit -v ./build/test/ttmlir/Silicon
- Run your test cases using ttrt
ttrt run /path/to/test.ttnn
ttrt run /path/to/dir/of/flatbuffers
Adding llvm-lit config options inside a .mlir file to create flatbuffer binaries
Inside of your .mlir file, you can add certain config options that llvm-lit will use when running against that test case. For the purpose of generating flatbuffer executables, you can add --ttir-load-system-desc="path=%system_desc_path%"
which will tell llvm-lit to parse the system desc found from the environment flag set by export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys
. You can also paste a custom path to a system desc file as well.
// RUN: ttmlir-opt --ttir-load-system-desc="path=%system_desc_path%" --ttir-implicit-device --ttir-layout --convert-ttir-to-ttnn %s > %t.mlir
// RUN: FileCheck %s --input-file=%t.mlir
// RUN: ttmlir-translate --ttnn-to-flatbuffer %t.mlir > %t.ttnn
Adding new mlir test cases
You can copy your .mlir test file (with the appropriate llvm-lit config options for generating flatbuffer binaries) into test/ttmlir/Silicon
. Then, follow generating flatbuffer files using llvm-lit to generate the executables to run!
Versioning
ttrt and flatbuffers have strict versioning check. When running a flatbuffer against ttrt, you have to make sure the flatbuffer was generated using the same version as ttrt (or vice versa). Major and Minor versions are manually set using github tags when releases are made. Patch versioning is the number of commits from the last major/minor tag.
vmajor.minor.patch
APIs
ttrt --help
ttrt read
ttrt run
ttrt query
ttrt perf
ttrt check
Command Line
There are different ways you can use the APIs under ttrt. The first is via the command line as follows. All artifacts are saved under ttrt-artifacts
folder under TT_MLIR_HOME
environment variable. By default, all logging is printed to the terminal. You can specify a log file to dump output to.
read
Read sections of a binary file
ttrt read --help
ttrt read --section mlir out.ttnn
ttrt read --section cpp out.ttnn
ttrt read --section version out.ttnn
ttrt read --section system_desc out.ttnn
ttrt read --section inputs out.ttnn
ttrt read --section outputs out.ttnn
ttrt read --section all out.ttnn
ttrt read --section all out.ttnn --clean-artifacts
ttrt read --section all out.ttnn --save-artifacts
ttrt read --section all /dir/of/flatbuffers
ttrt read system_desc.ttsys
ttrt read --section system_desc system_desc.ttsys
ttrt read system_desc.ttsys --log-file ttrt.log
run
Run a binary file or a directory of binary files
Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON
.
ttrt run --help
ttrt run out.ttnn
ttrt run out.ttnn --seed 0
ttrt run out.ttnn --init arange
ttrt run out.ttnn --identity
ttrt run out.ttnn --identity --rtol 1 --atol 1
ttrt run out.ttnn --clean-artifacts
ttrt run out.ttnn --save-artifacts
ttrt run out.ttnn --loops 10
ttrt run --program-index all out.ttnn
ttrt run --program-index 0 out.ttnn
ttrt run /dir/of/flatbuffers
ttrt run /dir/of/flatbuffers --loops 10
ttrt run /dir/of/flatbuffers --log-file ttrt.log
query
Query the system to obtain the system desc file (optionally store it to disk)
Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON
.
ttrt query --help
ttrt query --save-artifacts
ttrt query --clean-artifacts
ttrt query --save-artifacts --log-file ttrt.log
perf
Run performance mode of a binary file or a directory of binary files
Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON
. Also need perf enabled build -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
.
Note: You can collect host only related performance data via --host-only
flag. By default, host and device side performance data are both collected.
Restriction: /dir/of/flatbuffers
can only be used if collecting --host-only
(as performance data is collected upon closing of device, if we run a directory of flatbuffers, we cannot get accurate device performance data since device is only closed at end of execution).
Restriction: We can only run perf mode (for now) on .mlir files that have only 1 function (func.func)
ttrt perf --help
ttrt perf out.ttnn
ttrt perf out.ttnn --clean-artifacts
ttrt perf out.ttnn --save-artifacts
ttrt perf out.ttnn --loops 10
ttrt perf --program-index all out.ttnn
ttrt perf --program-index 0 out.ttnn
ttrt perf --host-only out.ttnn
ttrt perf /dir/of/flatbuffers --host-only
ttrt perf /dir/of/flatbuffers --loops 10 --host-only
ttrt perf /dir/of/flatbuffers --log-file ttrt.log --host-only
check
Check a binary file or a directory of binary files against a system desc (by default, uses the host machine)
Note: It's required to be on a system with silicon and to have a runtime enabled build -DTTMLIR_ENABLE_RUNTIME=ON
.
ttrt check --help
ttrt check out.ttnn
ttrt check out.ttnn --system-desc /path/to/system_desc.ttsys
ttrt check out.ttnn --clean-artifacts
ttrt check out.ttnn --save-artifacts
ttrt check out.ttnn --log-file ttrt.log
ttrt check /dir/of/flatbuffers --system-desc /dir/of/system_desc
ttrt as a python package
The other way to use the APIs under ttrt is importing it as a library. This allows the user to use it in custom scripts.
Import ttrt as a python package
from ttrt.common.api import API
Setup API and register all features
API.initialize_apis()
Setup arguments
You can specify certain arguments to pass to each API, or use the default arguments provided
args
This can be a dictionary of values to set inside your API instance. These are the same options as found via the command line. You can get the total list of support arguments via API.registered_args dictionary. Any argument not provided will be set to the default.
custom_args = API.Query.registered_args
custom_args["clean-artifacts"] = True
query_instance = API.Query(args=custom_args)
custom_args = { "clean-artifacts": True }
query_instance = API.Query(args=custom_args)
logging
You can specify a specific logging module you want to set inside your API instance. The rationale behind this is to support different instances of different APIs, all being able to be logged to a different file.
from ttrt.common.util import Logger
log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)
read_instance = API.Read(logging=custom_logger)
artifacts
You can specify a specific artifacts directory to store all the generate metadata during the execution of any API run. This allows you to specify different artifact directories if you wish for different instances of APIs.
from ttrt.common.util import Artifacts
log_file_name = "some_file_name.log"
artifacts_folder_path = "/opt/folder"
custom_logger = Logger(log_file_name)
custom_artifacts = Artifacts(logging=custom_logger, artifacts_folder_path=artifacts_folder_path)
run_instance = API.Run(artifacts=custom_artifacts)
Execute API
Once all the arguments are setup, you can run your API instance with all your provided arguments. Note, APIs are stateless. Thus, subsequent calls to the same API instance will not preserve previous call artifacts. You can generate a new artifacts directory for subsequent runs if you wish to call the APIs multiple times, for example.
query_instance()
read_instance()
run_instance()
Putting it all together
You can do interesting stuff when combining all the above features into your python script
from ttrt.common.api import API
from ttrt.common.util import Logger
from ttrt.common.util import Artifacts
API.initialize_apis()
custom_args = API.Run.registered_args
custom_args["clean-artifacts"] = True
custom_args["save-artifacts"] = True
custom_args["loops"] = 10
custom_args["init"] = "randn"
custom_args["binary"] = "/path/to/subtract.ttnn"
log_file_name = "some_file_name.log"
custom_logger = Logger(log_file_name)
artifacts_folder_path = "/opt/folder"
custom_artifacts = Artifacts(logging=custom_logger, artifacts_folder_path=artifacts_folder_path)
run_instance = API.Run(args=custom_args, logging=custom_logger, artifacts=custom_artifacts)
FAQ
Flatbuffer version does not match ttrt version!
- ttrt and flatbuffer have strict versioning that is checked during ttrt execution. You will have to generate a flatbuffer using the same version of ttrt (or vice versa). This mean you might have to build on the same branch on which the flatbuffer was generated or regenerate the flatbuffer using your current build.
System desc does not match flatbuffer!
- flatbuffers are compiled using a specific system desc (or default values if no system desc is provided). During runtime, the flatbuffer system desc is checked against the current system to ensure the system being run on supports the flatbuffer that was compiled. If you get this error, you will have to regenerate the flatbuffer using the system you want to run on. See generate a flatbuffer file from compiler section on how to do this.
I just want to test and push my commit! What do I do!
- follow these steps (on both n150 and n300)
1. Build ttmlir (sample instructions - subject to change)
source env/activate
cmake -G Ninja -B build -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang-17 -DCMAKE_CXX_COMPILER=clang++-17 -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DTTMLIR_ENABLE_RUNTIME=ON -DTT_RUNTIME_ENABLE_PERF_TRACE=ON
cmake --build build
2. Build ttrt (sample instructions - subject to change)
cmake --build build -- ttrt
3. Query system
ttrt query --save-artifacts
4. Export system desc file
export SYSTEM_DESC_PATH=/path/to/system_desc.ttsys (path dumped in previous command)
5. Generate test cases
cmake --build build -- check-ttmlir
6. Run test cases
ttrt run build/test/ttmlir/Silicon
7. (Optional) Run perf test cases
ttrt perf build/test/ttmlir/Silicon
tt-explorer
Visualizer tool for ttmlir
-powered compiler results. Visualizes from emitted .mlir
files to display compiled model, attributes, performance results, and provide a platform for human-driven overrides to gameify model tuning.
For more information and installation details please check out the repository README at https://github.com/vprajapati-tt/tt-explorer?tab=readme-ov-file#tt-explorer
Flatbuffers
Flatbuffers are the binary serialization format used by TTMLIR and they currently come in a few flavors (designated by the file extension):
.ttsys
: A system description file that is the mechanism for supplying target information to the compiler. These can be collected on a target machine and downloaded to a development machine to enable cross-compilation..ttnn
: A compiled binary file intended to be loaded and executed by the TTNN backend runtime..ttb
: A compiled binary file intended to be loaded and executed by the TTMetal backend runtime (Unsupported).
ci
Our CI infrastructure is currently hosted on cloud. Cloud machines are used and linked as GitHub runners.
Key Words
Target Silicon (coming soon)
- 1:1 mapping to unique system-desc (this is because an n150 card can have different harvested rows)
Target Family
- product type (n150, n300)
Target Capabilities (coming soon)
- describes testable traits of Target Family
n150: {
test params to use if running on n150
}
n300: {
test params to use if running on n150
}
Test Capabilities (coming soon)
- set of target capabilities defined in the test
- test will populate certain parameters depending on the Target Family/Target Silicon it is running on
GitHub Runner CI Tags
Runner Use
There are 2 types of runner machines. Builders build offline and runners are silicon machines.
- builder
- runner
Runner Type
There are 2 runner types. Bare metals are standalone and virtual machines are kubernetes pods.
- bare-metal
- virtual-machine
Architecture
Supported architectures
- wormhole_b0
- blackhole (coming soon)
Pipeline Type
Supported pipelines
- perf
- functional
Active
Defines whether a runner is in service or taken out of service for maintenance
- in-service
- out-of-service
Target Family
Supported configurations of machines
- n150
- n300
- t3000 (coming soon)
- tg (coming soon)
- tgg (coming soon)
Target Silicon (coming soon)
-silicon-n150-0 (0th row harvested)
-silicon-n150-1 (1th row harvested)
-silicon-n300-0-0 (0th row harvested both chips)
Pipeline durations
- push: every push to main
- pr: every PR
CI Test Flow
1. GitHub runner
- build tt-mlir
- build ttrt
- upload artifacts
2. Silicon runner
- download tt-mlir / ttrt artifacts
- ttrt generate system desc
- llvm-lit runs all unit test, including silicon ones to generate flatbuffers (will only generate ones that are supported for that test file)
- ttrt runs generated flatbuffers
Adding a test
When adding a test, you can specify when the test should run and what values it should inherit. The test defines how it should run, not the infrastructure. The infrastructure will execute what the test defines. For now, if you specify nothing, it will run on all default parameters. Note: if you provide a target family, then it will be default run on any target silicon machine. If you need a specific target silicon machine (eg one with 1st row harvested), specify it in Target Silicon. Note: if you specify perf pipeline, it will automatically run on a bare metal machine Default parameters
[Architecture]: [wormhole_b0]
[Pipeline]: [functional, perf]
[Target Family]: [n150, n300]
[Target Silicon]: []
[Duration]: [push]
Location: test/ttmlir/Silicon
File Type: .mlir
REQUIRES: [Architecture] [Pipeline] [Target Family] [Target Silicon] [Duration] (coming soon)
UNSUPPORTED: [Target Family] [Target Silicon] (coming soon)
Additional Reading
This section contains pointers to reading material that may be useful for understanding the project.
MLIR
- https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/index.html
- https://mlir.llvm.org/docs/Tutorials/Toy/
- https://www.jeremykun.com/2023/08/10/mlir-getting-started/
- https://arxiv.org/pdf/2002.11054
- https://ieeexplore.ieee.org/abstract/document/9370308
Dialects
- affine dialect
- Affine map is a really powerful primitive that can be used to describe most data movement patterns.
- It can also be used to describe memory layouts.
- linalg dialect
- tosa dialect
- tosa spec
- memref dialect
- torch-mlir
- onnx-mlir
- triton-mlir
Tablegen
LLVM Testing Framework Tools
Jax
Flatbuffer
Openxla Website
openxla
StableHLO
Contributor Covenant Code of Conduct
Our Pledge
We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
Our Standards
Examples of behavior that contributes to a positive environment for our community include:
- Demonstrating empathy and kindness toward other people
- Being respectful of differing opinions, viewpoints, and experiences
- Giving and gracefully accepting constructive feedback
- Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
- Focusing on what is best not just for us as individuals, but for the overall community
Examples of unacceptable behavior include:
- The use of sexualized language or imagery, and sexual attention or advances of any kind
- Trolling, insulting or derogatory comments, and personal or political attacks
- Public or private harassment
- Publishing others' private information, such as a physical or email address, without their explicit permission
- Other conduct which could reasonably be considered inappropriate in a professional setting
Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.
Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.
Scope
This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official e-mail address, posting via an official social media account, or acting as an appointed representative at an online or offline event.
Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at nsmith@tenstorrent.com or staylor@tenstorrent.com. All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the reporter of any incident.
Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:
1. Correction
Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.
Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.
2. Warning
Community Impact: A violation through a single incident or series of actions.
Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.
3. Temporary Ban
Community Impact: A serious violation of community standards, including sustained inappropriate behavior.
Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.
4. Permanent Ban
Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.
Consequence: A permanent ban from any sort of public interaction within the community.
Attribution
This Code of Conduct is adapted from the Contributor Covenant, version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.
Community Impact Guidelines were inspired by Mozilla's code of conduct enforcement ladder.
For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.
Project Structure
env
: Contains the environment setup for building project dependencies, such as LLVM and Flatbuffersinclude/ttmlir
: Public headers for the TTMLIR libraryDialect
: MLIR dialect interfaces and definitions, dialects typically follow a common directory tree structure:IR
: MLIR operation/type/attribute interfaces and definitionsPasses.[h|td]
: MLIR pass interfaces and definitionsTransforms
: Common MLIR transformations, typically invoked by passes
Target
: Flatbuffer schema definitions. This defines the binary interface between the compiler and the runtime
lib
: TTMLIR library implementationCAPI
: C API for interfacing with the TTMLIR library, note this is needed for implementing the python bindings. Read more about it here: https://mlir.llvm.org/docs/Bindings/Python/#use-the-c-apiDialect
: MLIR dialect implementations
runtime
: Device runtime implementationinclude/tt/runtime
: Public headers for the runtime interfacelib
: Runtime implementationtools/python
: Python bindings for the runtime, currently this is wherettrt
is implemented
test
: Test suitetools/ttmlir-opt
: TTMLIR optimizer driver
Namespaces
mlir
: On the compiler side, we use the MLIR namespace for all MLIR types and operations and subnamespace for our dialects.mlir::tt
: Everything ttmlir related is underneath this namespace. Since we need to subnamespace undermlir
, justmlir::tt
seemed better thanmlir::ttmlir
which feels redundant.mlir::tt::ttir
: The TTIR dialect namespacemlir::tt::ttnn
: The TTNN dialect namespacemlir::tt::ttmetal
: The TTMetal dialect namespacemlir::tt::ttkernel
: The TTKernel dialect namespace
tt::runtime
: On the runtime side, we use thett::runtime
namespace for all runtime types and operations.tt::runtime::ttnn
: The TTNN runtime namespacett::runtime::ttmetal
: The TTMetal runtime namespace (not implemented)
Dialects Overview
Here is a brief overview of the dialects in the project, please refer to the individual dialect documentation for more details.:
tt
: Common types such as,tt.tile
,tt.layout
,tt.grid
, etc. and enums such as, data formats, memory spaces, iterator types etc.ttir
: A high level dialect that models the tensor compute graph on tenstorrent devices. Acceptstosa
andlinalg
input.ttir.generic
: Generically describe compute work.ttir.to_layout
: Convert between different tensor memory layouts and transfer between different memory spaces.tensor.pad
: Pad a tensor with a value (ie. convs)ttir.yield
: return result memref of computation in dispatch region body, lowers tottkernel.yield
ttir.kernel
: lowers to some backend kernel
ttnn
: A TTNN dialect that models ttnn API.ttkernel
: Tenstorrent kernel library operations.ttkernel.noc_async_read
ttkernel.noc_async_write
ttkernel.cb_push_back
ttkernel.[matmul|add|multiply]
: Computations on tiles in source register space, store the result in dest register space.ttkernel.sfpu_*
: Computations on tiles in dest register space using sfpu coprocessor.
ttmetal
: Operations that dispatch work from host to device.ttmetal.dispatch
: Dispatch a grid of compute work.
Adding an Op
This guide will walk you through the process of adding a new Op end to end in
tt-mlir, in this case we will be adding a matmul
operation. Note that the matmul
op was added as part of the same changeset as this guide, it could be useful to
reference the diff alongside this guide to see the changes in full.
This guide will cover the following steps:
- Adding an Op
1. Define the Op in the TTIR frontend dialect
We will start by defining the Op in the TTIR dialect. The TTIR Ops are defined
in a tablegen file located at include/ttmlir/Dialect/TTIR/IR/TTIROps.td
.
Tablegen is a domain-specific language for defining ops/types/attributes in MLIR and LLVM, these definitions constitute the dialect's Operation Definition Specification (ODS).
Here is an example of defining matmul
in the TTIR dialect:
def TTIR_MatmulOp : TTIR_DPSOp<"matmul"> {
let summary = "Matrix multiply operation.";
let description = [{
Matrix multiply operation.
}];
let arguments = (ins AnyRankedTensor:$a,
AnyRankedTensor:$b,
AnyRankedTensor:$output,
TT_OperandConstraintArrayAttr:$operand_constraints);
let results = (outs AnyRankedTensor:$result);
let extraClassDeclaration = [{
MutableOperandRange getDpsInitsMutable() { return getOutputMutable(); }
}];
let hasVerifier = 1;
}
There are many things to break down here, starting from the top:
def
in tablegen is used to define a concrete type, this will have a 1-1 mapping to a C++ generated class, and for this particular case the build will end up generating filebuild/include/ttmlir/Dialect/TTIR/IR/TTIROps.h.inc
.- It inherits from
class TTIR_DPSOp
, classes in tablegen don't define a concrete type, but rather an interface that augment or constrain inheriteddef
s.TTIR_DPSOp
is a class that defines the common attributes for all TTIR Ops that implement Destination Passing Style (DPS) semantics. DPS just means that the result tensor is passed as an argument to the operation which will be critical for modeling buffer allocation / lifetimes. Note the 3rd argumentAnyRankedTensor:$output
. - Next we have a list of
arguments
. These arguments consist of a mixture ofType
s (i.e.AnyRankedTensor
) andAttribute
s (i.e.TT_OperandConstraintArrayAttr
). Read more about Types & Attributes here.AnyRankedTensor
is part of a tablegen standand library which type aliases to MLIR's builtin Tensor type, with the added constraint that the tensor has a static rank. As much as possible we want to use the builtin types and infrastructure provided by MLIR.TT_OperandConstraintArrayAttr
is a custom attribute that we have defined in theTT
dialect. This attribute is used to specify constraints on the operands of the operation. For example, theTTIR_MatmulOp
requires that the input tensors be in tile layout, this attribute captures this constraint.
- Next we have a list of
results
in this case just 1, which aliases theoutput
tensor. One drawback of DPS is that the result tensor and the output tensor will appear to have different SSA names in the IR, but they really alias the same object. This can make writing some passes more cumbersome. - Next we have
extraClassDeclaration
, which enables us to inject member functions, written directly in C++, into the generated class. We are doing this for this particular case in order to satisfy the DPS interface which requires an implementation for getting the mutatated output tensor. - Finally, we have
hasVerifier = 1
, this tells MLIR that we have a verifier function that will be called to validate the operation. This is a good practice to ensure that the IR is well formed.
We can now try building and opening the TTIROps.h.inc
file to see the generated C++ code.
We will actually get a linker error because we have hasVerifier = 1
which
automatically declared a verifier function, but we need to go implement.
Let's head over to lib/Dialect/TTIR/IR/TTIROps.cpp
and implement the verifier.
::mlir::LogicalResult mlir::tt::ttir::MatmulOp::verify() {
::mlir::RankedTensorType inputAType = getA().getType();
::mlir::RankedTensorType inputBType = getB().getType();
::mlir::RankedTensorType outputType = getOutput().getType();
auto inputAShape = inputAType.getShape();
auto inputBShape = inputBType.getShape();
auto outputShape = outputType.getShape();
if (inputAShape.size() < 2) {
return emitOpError("Input A must be at least a 2D tensor");
}
if (inputBShape.size() < 2) {
return emitOpError("Input B must be at least a 2D tensor");
}
if (inputAShape.size() != inputBShape.size()) {
return emitOpError("Input A and B must have the same rank");
}
if (inputAShape.size() != outputShape.size()) {
return emitOpError("Input A and B must have the same rank as the output");
}
if (inputAShape[inputAShape.size() - 1] !=
inputBShape[inputBShape.size() - 2]) {
return emitOpError("Input A and B must have matching inner dimensions");
}
if (outputShape[outputShape.size() - 2] !=
inputAShape[inputAShape.size() - 2]) {
return emitOpError("Output must have the same number of rows as input A");
}
if (outputShape[outputShape.size() - 1] !=
inputBShape[inputBShape.size() - 1]) {
return emitOpError(
"Output must have the same number of columns as input B");
}
return success();
}
2. Define the Op in the TTNN backend dialect
Next we will define the Op in the TTNN dialect. TTNN Ops are defined in the same way, but in their respective set of dialect files. Refer to the previous section for details, the process is the same.
TTNNOps.td
def TTNN_MatmulOp : TTNN_NamedDPSOp<"matmul"> {
let arguments = (ins AnyRankedTensor:$a,
AnyRankedTensor:$b,
AnyRankedTensor:$output);
let results = (outs AnyRankedTensor:$result);
let extraClassDeclaration = [{
MutableOperandRange getDpsInitsMutable() { return getOutputMutable(); }
}];
let hasVerifier = 1;
}
TTNNOps.cpp
::mlir::LogicalResult mlir::tt::ttnn::MatmulOp::verify() {
::mlir::RankedTensorType inputAType = getA().getType();
::mlir::RankedTensorType inputBType = getB().getType();
::mlir::RankedTensorType outputType = getOutput().getType();
auto inputAShape = inputAType.getShape();
auto inputBShape = inputBType.getShape();
auto outputShape = outputType.getShape();
if (inputAShape.size() < 2) {
return emitOpError("Input A must be at least a 2D tensor");
}
if (inputBShape.size() < 2) {
return emitOpError("Input B must be at least a 2D tensor");
}
if (inputAShape.size() != inputBShape.size()) {
return emitOpError("Input A and B must have the same rank");
}
if (inputAShape.size() != outputShape.size()) {
return emitOpError("Input A and B must have the same rank as the output");
}
if (inputAShape[inputAShape.size() - 1] !=
inputBShape[inputBShape.size() - 2]) {
return emitOpError("Input A and B must have matching inner dimensions");
}
if (outputShape[outputShape.size() - 2] !=
inputAShape[inputAShape.size() - 2]) {
return emitOpError("Output must have the same number of rows as input A");
}
if (outputShape[outputShape.size() - 1] !=
inputBShape[inputBShape.size() - 1]) {
return emitOpError(
"Output must have the same number of columns as input B");
}
return success();
}
3. Convert / Implement the Op in the TTNN passes
Next we will implement the conversion from the TTIR matmul
Op to the TTNN matmul
Op.
This is a trivial conversion, as the Ops are identical in their semantics, so
the changeset isn't going to be very instructive, but will at least point to the
files involved. The conversion is implemented in the ConvertTTIRToTTNNPass
pass in
file lib/Conversion/TTIRToTTNN/TTIRToTTNNPass.cpp
.
Zooming into class ConvertTTIRToTTNNPass
we can see we implement the pass interface
via member function void runOnOperation() final
. This function will be called
for every operation matching the type specified in the pass tablegen file. A
quick look at include/ttmlir/Conversion/Passes.td
we can see:
def ConvertTTIRToTTNN: Pass<"convert-ttir-to-ttnn", "::mlir::ModuleOp"> {
This means that runOnOperation
will be called for every ModuleOp
in the
graph, usually there is only one ModuleOp
which serves as the root of the
graph.
Inside runOnOperation
is usually where we define a rewrite pattern set that
can match much more complicated patterns (nested inside of the ModuleOp
's
regions)
than just a single operation. In runOperation
method you will see the call to
method populateTTIRToTTNNPatterns(...)
that actually generates rewrite patterns.
Method populateTTIRToTTNNPatterns(...)
is defined
in lib/Conversion/TTIRToTTNN/TTIRToTTNN.cpp
.
patterns
.add<TensorEmptyConversionPattern,
ToLayoutOpConversionPattern,
ElementwiseOpConversionPattern<ttir::AbsOp, ttnn::AbsOp>,
ElementwiseOpConversionPattern<ttir::AddOp, ttnn::AddOp>,
ElementwiseOpConversionPattern<ttir::SubtractOp, ttnn::SubtractOp>,
ElementwiseOpConversionPattern<ttir::MultiplyOp, ttnn::MultiplyOp>,
ElementwiseOpConversionPattern<ttir::GreaterEqualOp, ttnn::GreaterEqualOp>,
ElementwiseOpConversionPattern<ttir::MaximumOp, ttnn::MaximumOp>,
ElementwiseOpConversionPattern<ttir::NegOp, ttnn::NegOp>,
ElementwiseOpConversionPattern<ttir::ReluOp, ttnn::ReluOp>,
ElementwiseOpConversionPattern<ttir::SqrtOp, ttnn::SqrtOp>,
ElementwiseOpConversionPattern<ttir::SigmoidOp, ttnn::SigmoidOp>,
ElementwiseOpConversionPattern<ttir::ReciprocalOp, ttnn::ReciprocalOp>,
ElementwiseOpConversionPattern<ttir::ExpOp, ttnn::ExpOp>,
ElementwiseOpConversionPattern<ttir::DivOp, ttnn::DivOp>,
ReductionOpConversionPattern<ttir::SumOp, ttnn::SumOp>,
ReductionOpConversionPattern<ttir::MeanOp, ttnn::MeanOp>,
ReductionOpConversionPattern<ttir::MaxOp, ttnn::MaxOp>,
BroadcastOpConversionPattern,
EmbeddingOpConversionPattern,
SoftmaxOpConversionPattern,
TransposeOpConversionPattern,
ConcatOpConversionPattern,
ReshapeOpConversionPattern,
SqueezeOpConversionPattern,
UnsqueezeOpConversionPattern,
ConstantOpConversionPattern,
MatmulOpConversionPattern,
Conv2dOpConversionPattern,
MaxPool2dOpConversionPattern
>(typeConverter, ctx);
More information on rewrite patterns and their capabilities can be found in the MLIR documentation here and here.
For matmul, we defined a new conversion pattern that's generic to all binary ops
with arguments named a
and b
:
class MatmulOpConversionPattern : public OpConversionPattern<ttir::MatmulOp> {
public:
using OpConversionPattern<ttir::MatmulOp>::OpConversionPattern;
LogicalResult
matchAndRewrite(ttir::MatmulOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {
rewriter.replaceOpWithNewOp<ttnn::MatmulOp>(
op, this->getTypeConverter()->convertType(op.getType()), adaptor.getA(),
adaptor.getB(), adaptor.getOutput());
return success();
}
};
Invoked as part of the rewrite set:
MatmulOpConversionPattern
Note:
We also need to add this op to the C++ emitter,
lib/Conversion/TTNNToEmitC/TTNNToEmitC.cpp
see
populateTTNNToEmitCPatterns(...)
.
4. Add a unit test for the Op
So far we have defined the Op in the TTIR and TTNN dialects,
implemented verifiers, and have conversion passes. Now we need to add a unit
test to ensure that the pass is working correctly. The unit tests are located
in test/ttmlir/Dialect
area. In this case we'll add a test under the TTNN
subdirectory since we are testing the ConvertTTIRToTTNNPass
.
test/ttmlir/Dialect/TTNN/simple_matmul.mlir
// RUN: ttmlir-opt --ttir-load-system-desc --ttir-implicit-device --ttir-layout --convert-ttir-to-ttnn %s | FileCheck %s
#any_device_tile = #tt.operand_constraint<dram|l1|tile|any_device_tile>
// CHECK: #[[TILED_LAYOUT:.*]] = #tt.layout<(d0, d1) -> (d0, d1), undef, <1x1>, memref<2x4x!tt.tile<32x32, bf16>, #dram>, interleaved>
module attributes {} {
func.func @forward(%arg0: tensor<64x128xbf16>, %arg1: tensor<128x96xbf16>) -> tensor<64x96xbf16> {
%0 = tensor.empty() : tensor<64x96xbf16>
// CHECK: %[[C:.*]] = "ttnn.matmul"[[C:.*]]
%1 = "ttir.matmul"(%arg0, %arg1, %0) <{operand_constraints = [#any_device_tile, #any_device_tile, #any_device_tile]}> : (tensor<64x128xbf16>, tensor<128x96xbf16>, tensor<64x96xbf16>) -> tensor<64x96xbf16>
return %1 : tensor<64x96xbf16>
}
}
Unit tests in MLIR are typically written using a tool called
FileCheck
, please refer to the llvm FileCheck documentation for a tutorial and more information about theRUN
andCHECK
directives.
A few things to point out specifically regarding tt-mlir dialects:
tt.system_desc
: This is a 1-1 mapping to theSystemDesc
flatbuffer schema that is used to describe the system configuration. This is a required attribute tagged on the top level module for all tt-mlir dialects.- Pass
--ttir-layout
is a prerequisite before runningconvert-ttir-to-ttnn
. This pass is responsible for converting the input tensors to device memory space and tile layout before lowering to TTNN. - This test is asserting that
ttir.matmul
converts tottnn.matmul
.
To run the test, you can use the following command:
cmake --build build -- check-ttmlir
You can also manually run ttmlir-opt
on the test file to see the
resulting output:
./build/bin/ttmlir-opt --ttir-layout --convert-ttir-to-ttnn test/ttmlir/Dialect/TTNN/simple_matmul.mlir
5. Define flatbuffer schema for the Op
Next we will define the flatbuffer schema for the Op. The schema must capture all tensor inputs, outputs, and attributes of the Op, i.e. everything the runtime needs to execute the Op.
include/ttmlir/Target/TTNN/program.fbs
table MatmulOp {
in0: tt.target.TensorRef;
in1: tt.target.TensorRef;
out: tt.target.TensorRef;
}
Type TensorRef
, flatbuffer tables with suffix Ref
are used to represent live values
during the runtime, decoupled from the underlying Desc
suffixes which carry the
type and attribute information for the object.
We also add this new op to the union OpType
, which is the variant type for all
ops.
More information about writing flatbuffer schemas can be found in the flatbuffers documentation
6. Serialize the Op in the flatbuffer format
In the previous section we defined the flatbuffer schema for the matmul
Op, now let's put our new schema definition to use. The schema is used as input
to a program called flatc
which generates C++ code (or any language for that
matter) for serializing and deserializing the schema. This generated code can be
found in build/include/ttmlir/Target/TTNN/program_generated.h
.
Let's head over to lib/Target/TTNN/TTNNToFlatbuffer.cpp
to define
a createOp
overloaded function that does the conversion from MLIR to flatbuffer:
::flatbuffers::Offset<::tt::target::ttnn::MatmulOp>
createOp(FlatbufferObjectCache &cache, MatmulOp op) {
auto in0 =
cache.at<::tt::target::TensorRef>(getOperandThroughDPSOps(op.getA()));
auto in1 =
cache.at<::tt::target::TensorRef>(getOperandThroughDPSOps(op.getB()));
auto output = cache.at<::tt::target::TensorRef>(
getOperandThroughDPSOps(op.getResult()));
return ::tt::target::ttnn::CreateMatmulOp(*cache.fbb, in0, in1, output);
}
Lots of things are happening here, let's break it down:
FlatbufferObjectCache
: This is a helper class that is used to cache objects in the flatbuffer that are created during the serialization process. This is necessary for managing value lifetimes and identifiers, at the same time it is an optimization to avoid having multiple copies of the same object. For example, aTensorRef
with multiple uses could naively be recreated, one for each use, but with the cache we can ensure that the object is only created once and all uses point to the same flatbuffer offset. The cache is passed around to all serialization functions and should be used whenever creating a new object.getOperandThroughDPSOps
: In section 1. we discussed DPS semantics and the drawback of having the result alias the output tensor. This is one of those cases where we need to use a helper function to trace through the output operands to find the original SSA name in order to associate it with the originalTensorRef
.CreateMatmulOp
: The autogenerated function from the flatbuffer schema that actually serializes the data into the flatbuffer format.
We can finally generate a binary with our new Op! We can use the following command:
./build/bin/ttmlir-opt --ttir-to-ttnn-backend-pipeline test/ttmlir/Dialect/TTNN/simple_matmul.mlir | ./build/bin/ttmlir-translate --ttnn-to-flatbuffer -o out.ttnn
And we can inspect the with ttrt
:
ttrt read out.ttnn
7. Add runtime support for the Op
The final step is to add runtime support for the Op by parsing the flatbuffer and invoking the TTNN API.
runtime/lib/ttnn/program.cpp
A couple things to note from above:
- Most runtime op functions will follow a similar pattern, they will take in some additional datastructures for managing live tensors.
liveTensors.at(op->in0()->global_id())
:global_id
is a unique identifier for the tensor that was generated and managed by theFlatbufferObjectCache
. This is how it's intended to be used by the runtime.
We can test our changes with ttrt
(don't forget to rebuild ttrt
):
ttrt run out.ttnn
Doxygen
This is a link to a doxygen autogenerated code reference. Doxygen
Build Instructions
To build Doxygen use the doxygen target in CMake
cmake -B build
cmake --build build -- doxygen
Specifications
Specifications are documents that define the requirements for features or concepts that are particularly cross-cutting, complex, or require a high degree of coordination and planning. They are intended to be a living document that evolves as the feature is developed and should be maintained as the goto reference documentation for the feature or concept.
Specifications are written in markdown and are stored in the docs/src/specs
directory of the repository. Below is a template that should be used when
creating a new specification.
Specification Template
# [Title]
A brief description of the feature or concept that this specification is
defining.
## Motivation
A description of why this feature or concept is needed and what problem it is
solving. This section is best written by providing concrete examples and use
cases.
## Proposed Changes
A list of the components that will be impacted by this spec and a detailed
description of the changes that will be made to each respective component.
It should also call out any interactions between components and how they might
share an interface or communicate with each other.
## Test Plan
A brief description of how the feature or concept will be tested.
## Concerns
A list of concerns that have been identified during the design of this feature.
Runtime Stitching
Runtime stitching adds the ability for the runtime to stitch together multiple, indepently compiled programs together at runtime, ie. without compiler knowledge of how the binary programs will be composed.
Motivation
In order to flexibly support arbitrary training schedules / composing multiple models together we want to have the ability for the runtime to stitch graphs together. To achieve this we need to define an ABI kind of interface between the compiler and the runtime.
Simple Example
mod_a = forge.compile(PyTorch_module_a)
mod_b = forge.compile(PyTorch_module_b)
for i in range(10):
outs_a = mod_a(ins_a)
outs_b = mod_b(outs_a)
mod_a
and mod_b
are 2 independent compile steps, during the compile step for
mod_a
it should be completely unaware that mod_b
will take place and vice-versa.
In order to achieve this we propose a new runtime concept called stitching:
- forge invokes compile step for
mod_a
, tt-mlir compiler determines where the inputs (ins_a
) should live, host, device dram, device l1. tt-mlir returns metadata to forge describing where it wants the tensors to reside before invoking flatbuffer submission. - forge invokes compile step for
mod_b
, same happens as bullet 1 mod_a
is invoked at runtime, forge runtime needs to inspect the compiler metadata to determine where the tensors should live. Runtime manually invokes a new data copy command to get the tenors to the correct memory space / correct memory address.- forge runtime invokes
mod_a
program submit mod_b
is invoked at runtime, this time it might be that the compiler left the tensor outputs in L1, so no data copy is needed to start runningmod_b
since the inputs are already in the correct location.
A more concrete usecase would be a training loop where there are often multiple graphs composed together. #82 Or when we eventually support torch 2.0, the torch runtime can arbitrarily break the graph anywhere.
Proposed Changes
Compiler Metadata
Compiler will encode the input tensor layout information directly into the flatbuffer tensor desc. The flatbuffer schema already exists to express this, we just need to adopt populating it instead of assuming a canonical host layout.
Compiler will decide where the tensors should live, host, device dram, device l1.
Runtime
- Runtime will inspect the tensor desc metadata to determine where the tensors need to end up / what layout they should be in before invoking the program.
- New runtime API
Tensor toLayout(Tensor tensor, ::tt::target::TensorDesc* tensorDesc);
- Runtime will need to invoke
toLayout
on all input tensors before invoking the program.
Test Plan
- Add a new test to the runtime gtest suite that verifies the runtime can correctly stitch together 2 independently compiled programs.
Concerns
- Tensors pass through device memory spaces (dram, L1) will have a dynamic address, some arbitrary run order of flatbuffer could cause tensors to end up in non-ideal locations in memory. Specifically, L1, a poorly placed tensor might not be able to be moved to a better location without a bounce through DRAM.
Tensor Layout
The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.
Motivation / High level goals
- Logical shapes: Keep the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
- Flexible sharding: Enable flexibility in choosing grid shape, to get better parallelization and avoid resharding. This is particularly important in cases where tensor shapes are not clean powers of two and would otherwise force our hand in choosing non-optimal grid shapes.
- Logical-Physical Isomorphism: Encode this information with just a few attributes to enable derived conversions from logical to physical layout and back.
- Explicit: A single source of truth.
- Enable a direct way to query padded regions.
An Example / Walkthrough
Let's consider a snippet of MLIR:
tensor<2x3x64x128xf32>
Here we've defined a 4 dimensional tensor using MLIR's builtin tensor type. This tensor type has an optional attribute called an Encoding, this attribute has been used by the TT dialect to encode the tensor's layout. This looks like:
tensor<2x3x64x128xf32,
#tt.layout<
(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
undef,
<1x1>,
memref<384x128xf32, #tt.memory_space<l1>>
>
>
At the time of this writing there are 4 properties that make up a tensor layout:
linear
: An affine map that defines how the logical tensor dimensions map to a grid shape. Note that the number of dims in the affine map must match exactly the rank of the original tensor, and the number of results must match exactly the rank of the grid shape.oob_val
: A tracked out of bounds value that fills padding space.grid
: The grid shape that this tensor is divided onto.memref
: A memref that describes the physical footprint allocation of the shard. It must also have a shape with rank equal to grid.
This example isn't particularly complicated because it's only sharded to a 1x1 grid, the rest of the document will go into more details on the following topics:
Before we jump into more advanced topics there are two resources that could be useful to have at hand:
test/python/tensor_layout.py
: Python test with many convenience functions for creating and experimenting with tensor layouts.- TTNN Interactive Visualizer: An interactive visualation tool that demonstrates the transformation. Note that this tool was created for TTNN tensor layout, but many of the same concepts transfer over.
Dimension Collapsing
Probably the most important concept in tt.layout
is dimension collapsing.
This is captured by the affine map linear
property which provides a
mapping from tensor dim space to a reduced physical dimensional space. This
single-handedly touches on most of the tensor layout goals mentioned at the
beginning of the doc:
- Leaves tensor shapes intact
- Logical-Physical mapping, how the tensor is laid out in memory over a grid
- Enables more flexible sharding
- Explicit padding
To see how these goals are achieved we'll continue working on an explicit example, same one as above:
(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3)
To recap, we have our example 4d tensor (2, 3, 64, 128)
, which maps directly to the
LHS (d0, d1, d2, d3)
. We have our 2d grid shape (1, 1)
, notice the
affine-map RHS is also 2d, and this describes how tensor dims map to a lower dimensional
physical memory, overlaid on a grid. We'll see how this gets divided onto the grid later,
but first let's look at how this forms an affine-map iteration space. If we
index our tensor at say [1, 1, 6, 100]
, we can simply plugin those numbers
to get our remapped offset:
(1 * 192 + 1 * 64 + 6, 100) = (262, 100)
This remapped offset (262, 100)
corresponds to the row and column index of the
collapsed physical memory.
By default, the dim range [0, -1)
is collapsed, but the tt.layout
contructor
can actually take a programmable range called collapseIntervals
.
collapseIntervals
is a list of pairs, where each pair is a dim range interval,
left inclusive, right exclusive. Let's consider a few examples:
Instead of multiplying out real shapes, we will use
<>
to represent a dimension join operator.
- 3D tensor onto a 2D grid and default
collapseIntervals=[(0, -1)]
:
(d0, d1, d2) -> (d0 <> d1, d2)
- 4D tensor onto a 3D grid and
collapseIntervals=[(1, -1)]
:
(d0, d1, d2, d3) -> (d0, d1 <> d2, d3)
- 4D tensor onto a 3D grid and
collapseIntervals=[(0, 2)]
:
(d0, d1, d2, d3) -> (d0 <> d1, d2, d3)
- 7D tensor onto a 4D grid and
collapseIntervals=[(0, 3), (-3, -1)]
:
(d0, d1, d2, d3, d4, d5, d6) -> (d0 <> d1 <> d2, d3, d4 <> d5, d6)
Multi-core
Let's consider the original example again, but on a larger grid than 1x1
, say 2x4
:
tensor<2x3x64x128xf32,
#tt.layout<
(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
undef,
<2x4>,
memref<192x32xf32, #tt.memory_space<l1>>
>
>
The number of affine map results, grid shape, and memref shape all must have the same rank. We can see in this example by changing the grid shape we also changed the memref shape, we can always calculate the memref shape by plugging in the full tensor dims into our affine map and then dividing by grid shape.
(d0, d1, d2, d3) -> (d0 * 192 + d1 * 64 + d2, d3),
(2 - 1, 3 - 1, 64 - 1, 128 - 1) = (1 * 192 + 2 * 64 + 63, 127) = (383, 127)
Above we actually subtracted 1 in order to get the index of the last element of the tensor. Now we can simply add back 1 to get the size:
(383 + 1, 127 + 1) = (384, 128)
Finally, we divide the dims by the respective grid dims:
(384 / 2, 128 / 4) = (192, 32)
Here's a few more example mlir snippets:
tensor<8x300xf32,
#tt.layout<(d0, d1) -> (d0, d1),
undef,
<1x2>,
memref<8x150xf32, #tt.memory_space<l1>>
>
>
tensor<8x96x32xf32,
#tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
undef,
<2x1>,
memref<384x32xf32, #tt.memory_space<l1>>
>
>
tensor<8x96x32xf32,
#tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
undef,
<2x1x2>,
memref<384x96x16xf32, #tt.memory_space<l1>>
>
>
tensor<5x3x2x2x7x32x32xf32,
#tt.layout<
(d0, d1, d2, d3, d4, d5, d6)
-> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
undef,
<3x2x2x2>,
memref<4480x4x16x16xf32, #tt.memory_space<l1>>
>
>
A couple of final notes regarding grid shape:
- Grid shapes of rank > 2 are
perfectly legal. Not only it this useful for describing multi-device grid
topologies, but it is often convenient to have higher ranked grids to better
describe how a high rank tensor should be divided. The grid shape here is a
virtual grid shape, the
tt.device
attribute will hold an additional affine map that defines how this virtual grid shape maps to a physical one. - Grid shapes where either columns or rows are > physical device grid is also
legal. Since this is only a virtual grid shape we could have some grid
1x64
that maps to a physical8x8
device grid (this particular example is called width sharding in TTNN).
Tilized
A tilized tensor is one with a memref that has a tile element type.
Given some tensor with scalar layout:
tensor<3x64x128xf32,
#tt.layout<
(d0, d1, d2) -> (d0 * 64 + d1, d2),
undef,
<3x2>,
memref<64x64xf32, #tt.memory_space<l1>>
>
>
After tilizing we'll have:
tensor<3x64x128xf32,
#tt.layout<
(d0, d1, d2) -> (d0 * 64 + d1, d2),
undef,
<3x2>,
memref<2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
Notice the memref
dim was ceilDiv
'd by tile shape and the element type becomes
a tt.tile
type. Also notice that the tensor shape and element type remains
intact.
Padding
Padding can be a bit of an overloaded term, but in this context it refers to an
out of bounds area in the physical memory allocation that has no real tensor
data in it. The contents of this area is tracked by oob_val
and the padding
area can be automatically derived from the attributes of tt.layout
.
Padding is a necessary evil that arises when a tensor is not evenly divisible by a grid shape or tile shape. It can also arise due to minimum Noc addressing requirements.
Example of non-divisible grid:
tensor<53x63xf32,
#tt.layout<
(d0, d1) -> (d0, d1),
undef,
<3x2>,
memref<18x32xf32, #tt.memory_space<l1>>
>
>
The grid dims always ceilDiv
the affine map results, real tensor data will
entirely fill initial shards and the last shard in each dimension will be
partially filled.
In this particular example, we have 1 scalar row of padding on the last row of cores and 1 scalar column of padding on the last column of cores.
Taking the above example a step further, we could tilize it:
tensor<53x63xf32,
#tt.layout<
(d0, d1) -> (d0, d1),
undef,
<3x2>,
memref<1x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
Tile dims also always ceilDiv
the resulting memref
shape. Notice now that
the padding is slightly more complicated. Our scalar shard shape was 18x32
,
but this was further padded to 32x32
meaning that every core now has 14 rows
of padding except for the last row of cores which has 15 rows of padding.
Also note that there is an order of operations here, grid divides the scalar shape first and then we tilize. This is important because it can enable use cases that frequently arise in conv networks that would otherwise result in reshards in between every layer.
With affine map we can be even more flexible in how we pad, we can bump our
stride between dimensions. Consider tensor (w/ batch dim 2
):
tensor<2x8x32xf32,
#tt.layout<
(d0, d1, d2) -> (d0 * 8 + d1, d2),
undef,
<1x2>,
memref<16x16xf32, #tt.memory_space<l1>>
>
>
If we tilized the above tensor we'd end up with a memref
shape of
1x1x!tt.tile<32x32>
, that is, all batches are tightly packed within a single
tile. Let's say that for some reason, we do not want the batches (2
) to be
tightly packed within a tile, perhaps the mathematical operation we're doing
requires the batch to be independently evaluated and thus the (S)FPU needs them
in separate tiles. We can adjust this by adjusting the stride of the affine
map:
(d0, d1, d2) -> (d0 * 32 + d1, d2),
Instead of striding by the number of logical rows, 8
, we bump the stride up to
32
effectively pushing a gap between the collapsed rows and enabling each
batch to fall on a tile boundary.
Memory Spaces
At the time of writing this document there are 4 memory spaces:
System
: Host memory space that is not device visible.SystemMMIO
: Host memory space that is device visible.DeviceDRAM
: DRAM local to the device.DeviceL1
: SRAM on each core.
Something worth noting here is that a tensor must belong exclusively to only
one of these memory spaces at a time. For example, in order to stream tensor
data from DeviceDRAM
to DeviceL1
you would need to either manually slice the
tensor into smaller tensors that do fit in L1 or have native support in the op's
kernel for double buffering a block (most TTNN ops already support this).
Multi-device
Multi-device can be naturally represented via a combination of two concepts
already touched on above, higher ranked grids and collapseIntervals
. Let's
consider the following example with a 3d grid and collapseIntervals=[(1, -1)]
.
tensor<2x3x64x128xf32,
#tt.layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
undef,
<2x2x4>,
memref<1x3x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
Here we've left the batch dim intact and started collapsing
at d1
. This enables us to define a 3d grid where the outermost grid dim
divides the batch directly. This could map to a 2 device system where the batch
dim is evenly divided between 2 devices. Within each device this op runs on a
2x4
grid.
The high level takeaway here is that how a tensor is logically divided up is decoupled from its mapping to physical compute resources. This has a nice property that data parallel extends to any tensor dimension and is captured under the same grid primitive that also divides tensor rows and columns.
Test Plan
test/python/tensor_layout.py
: Assertions forLayoutAttr
to make sure it's spec compliant.- Sweep tests:
- Grid dim sweeps
- Tilize / untilize sweeps
- Padding sweeps
- Multi-device tests
Concerns
tt.layout
is deliberately flexible and tries to capture as many problematic use-cases we've ran into in the past in a single, succinct representation. This flexibility will need to be further constrained by backends to avoid unsupported programming of this attribute.- Optimization solution space is potentially large with all of this flexibility.
Two things that I hope can help protect us here:
- By and large the heuristic we'll be following is just max the grid at all costs. This should really narrow down the solution space to only a handful of options and we only keep exploring if producers/consumers end up with nasty reblocking.
- We can constrain the optimizer heuristics as aggressively as possible in the beginning and just advertise the full flexible options to the UI model explorer. Hopefully this enables us to experiment with crazier grid layouts and prove it's worthwhile before writing an algorithm.
TTNN Tensor Layout
The above section of this document covers how the compiler models tensor layout. There are some slight differences in TTNN, but the high level idea of collapsing dims is still used.
Terms
shape
: Always logical shape, n-dimensionalstride
: Same as pytorch stride, but this is crucial for describing how n-dimensional data gets packed into a 2D physical layout. This 2D physical layout is always the inner dim (-1) wide and dims [0, N-1] are collapsed into rows derived from strideshard_shape
: Also a logical shape, describes a 2d region that chunks physical_shape . Note this does not need to be a tile multiplephysical_shard_shape
: The shard_shape padded out to tile_shapetile_shape
: A programmable tile shape, though constraints must check that it's compatible with an op's usage, i.e. FPU/Noc compatiblegrid_shape
: [divup(stride[0] // stride[-2], shard_shape[0]), divup(stride[-2], shard_shape[0])]
Mapping from the compiler
The compiler uses an affine map to explicitly track which dimensions are folded
together, but TTNN does not have affine maps so the representation is a bit more
implicit. TTNN captures the dimension collapsing in the stride
attribute
where dimensions [0, N-1]
are always collapsed. This is less flexible so the
compiler will have to enforce only collapsing supported dimensions when
targeting TTNN, or handle lowering in a different way. For example, in the
compiler we might want to represent data parallel over the tensor batch dim by
leaving d0
and collapsing d1 - d[-1]
. TTNN doesn't support this in its
tensor layout representation, but this could be lowered to a TTNN mesh tensor
where the mesh could be sliced on the batch and each per-device tensor has d0
fully collapsed.
TTNN Example
TTNN Interactive Visualizer
Device
Device in tt-mlir is somewhat of an overloaded term and can refer to different
things depending on the context. This document will only speak to the compiler's
abstract representation of a device captured by attribute #tt.device
.
Terms
There are many overloaded terms when talking about devices and grids, this document will use the following definitions:
- Physical Grid: A 2D array of tensix cores on a chip.
- Chip: A single physical chip with a Physical Grid of cores.
- Card: A PCIE or Ethernet card that may contain multiple Chips.
- System: A collection of Cards that are usually connected together on the
same host via PCIE or networked via ethernet. A system is represented by
SystemDesc
in the compiler. - Device: Device is always presented as a single entity to the enclosing scope, but it may be virtualized to abstract a multi-card System and part of its encoding carries a Logical Grid. Another way to think of device is a view over the system.
- Logical Grid or just Grid: Is a logical shape that abstracts one or more Physical Grids.
- Mesh Shape: Describes the virtual layout of the chips with respect to each other. In practice the mesh shape is used to derive the logical grid.
Motivation
The device attribute strives to achieve the following goals:
- Provide a convenient representation of a physical grid that decouples the logical division of tensors from the physical layout of the hardware. This not only simplifies reasoning about how tensors get divided into shards, but can also enable reinterpretations of the device grid for data layout optimization decoupled from the existing encoding of the tensor layouts.
- Following the first point, the device attribute should be able to represent many different forms of logical grids, from simple 2D grids, to more complex topologies like extra-wide grids or higher dimensional grids.
- Device attribute captures encoding both single chip and multi-chip systems under a single, virtualized representation.
- Enable many forms of data parallel execution strategies for single and multi chip systems under a single representation.
Scope
This document will cover how the device attribute is encoded and how it can be lowered to backend dialects. The document will not cover the algorithm for choosing the best, or even legal, device configurations for a given physical system.
Examples
All of the following examples will assume the physical hardware has an 8x8 physical
grid of cores. We will use notation [N, 8x8]
to represent a N
chip system,
each with an 8x8 physical grid.
#tt.device
in is simplest, single chip form [1, 8x8]
, just maps directly 1-1 to the
underlying physical hardware device.
#tt.device<
workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>,
meshShape = 1,
chipIds = [0]
>
Let's break down what each of these attributes mean:
workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>
: This is a 2D logical grid with dim 8x8. It's followed by an affine map(d0, d1) -> (0, d0, d1)
that provides a mapping from the logical grid to the physical grid. In this case, the logical grid is the same as the physical grid, so the mapping is the identity function. The logical grid can have any rank, but the physical mapping is always 3D, with the first being the chip index, followed by the 2D physical core index within the chip.meshShape = 1
: A shape provided as part of theDeviceAttr
constructor that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.chipIds = [0]
: This is a list of chip indices. These chip indices directly reference the same chip indices in the system descriptor. TheSystemDesc
attribute that this is in reference to is tagged on the top levelModuleOp
.
Specific examples that this document will cover:
- Data Parallel Over Batch
- Data Parallel Over 2d
- Data Parallel Over 2d and Batch
- Pipeline Parallel
- Reinterpreted Grids (Transpose)
- Reinterpreted Grids (Training Usecase)
- Reinterpreted Grids (Extra)
Before we move on to more complex examples, it's worth having on hand:
- The python test
test/python/device_attr.py
which shows how all of these examples can actually be programmed for the device attribute.- The Tensor Layout spec as the following examples will demonstrate how tensor layout interacts with the logical device grid.
Note on Data Parallel: There is existing literature that explicitly distinguishes between data parallel and tensor parallel, oftentimes describing data parallel as duplicating the model across multiple devices and trivially dividing up the batch whereas tensor parallel refers to tensor data being distributed and potentially communicated between devices during execution. While this is true for multi-GPU/CPU systems, it is somewhat of an implementation detail and given the flexibility of tenstorrent hardware there is an opportunity to generalize this concept. In this document we will use the term data parallel to refer to any form of parallelism that divides any dimension of the tensor across multiple cores/chips.
Note on Constraints: Many of the examples below require careful virtualization of the underlying physical system, i.e. some device configurations might only work if the chips are connected via ethernet and with a particular topology, but these constraints are outside the scope of the examples and will be discussed further in the Backend Lowering and Constraints section.
Data Parallel Over Batch
Given a 2 chip system, [2, 8x8]
, we can represent a simple data parallel
logical grid that divides the batch dimension in half across the two chips.
This is denoted by meshShape = 2x1x1
which means the logical grid is 3D.
#tt.device<
workerGrid = #tt.grid<2x8x8, (d0, d1, d2) -> (d0, d1, d2)>,
meshShape = 2x1x1,
chipIds = [0, 1]
>
The affine map here is just identity, so dims d1
and d2
directly index
the physical grid and d0
indexes the chip.
Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:
tensor<16x3x64x128xf32,
#tt.layout<(d0, d1, d2, d3) -> (d0, d1 * 64 + d2, d3),
undef,
<2x2x4>,
memref<8x3x1x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
If we map this tensor onto the above device, it will span across both chips, half of the batch dimension on each chip. Within each chip the tensor occupies a 2x4 grid out of the 8x8 physical grid available.
Data Parallel Over 2d
In this example we will consider a 2 chip system, [2, 8x8]
, and view it as
though the two chips are concatenated together side by side to form a single
8x16
grid. This is denoted by meshShape = 1x2
which means to concatenate
the chips in the second dimension.
#tt.device<
workerGrid = #tt.grid<8x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0, d1 mod 8)>,
meshShape = 1x2,
chipIds = [0, 1]
>
Here we can see that the affine map encodes an indexing pattern such that when we extend past 8 cores in the second dimension, we wrap around to the next chip.
Now we can consider some tensor that, importantly, has a grid of the same rank as the logical device grid:
tensor<256x1024xf32,
#tt.layout<(d0, d1) -> (d0, d1),
undef,
<4x16>,
memref<2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
This single tensor maps trivially onto the logical grid, spanning the upper half. Decoupled from the tensor's layout, under the hood the tensor is actually physically spanning across two chips.
Data Parallel Over 2d and Batch
The previous 2 examples can be composed together to form a logical grid that
divides tensor across multiple dimensions. Here we will consider a 4 chip
system [4, 8x8]
and view it as a 2x8x16
grid. Note that the meshShape
is
2x1x2
which means to concatenate the chips in the first and third dimensions.
#tt.device<
workerGrid = #tt.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
meshShape = 2x1x2,
chipIds = [0, 1, 2, 3]
>
We can evaluate the affine map to see that the chips are interpreted in chunks of
two, where groups [0, 1]
and [2, 3]
each form 8x16 grids and these 2 groups
concatenate to form a 2x8x16 grid.
We can consider the following tensor to map onto this grid:
tensor<64x256x1024xf32,
#tt.layout<(d0, d1) -> (d0, d1),
undef,
<2x4x16>,
memref<32x2x2x!tt.tile<32 x 32, bfp_bf8>, #tt.memory_space<l1>>
>
>
Pipeline Parallel
Pipeline parallel in the scope of this spec isn't particularly interesting, it
is intended to be used in conjunction with the ttir.pipeline
operation which
will group sections of the module's operations into groups to form pipeline regions
and will be covered in a separate spec.
What we can demonstrate here is how we can take multiple non-overlapping views of the system descriptor to form distinct virtual devices.
Given an 8 chip system [8, 8x8]
, we can form two virtual devices that each
take 4 chips and interpret them differently (though they could take the same
logical grid).
#tt.device<
workerGrid = #tt.grid<2x8x16, (d0, d1, d2) -> (d0 * 2 + (d1 floordiv 8) * 2 + d2 floordiv 8, d1, d2 mod 8)>,
meshShape = 2x1x2,
chipIds = [0, 1, 2, 3]
>
#tt.device<
workerGrid = #tt.grid<16x16, (d0, d1) -> ((d0 floordiv 8) * 2 + d1 floordiv 8, d0 mod 8, d1 mod 8)>,
meshShape = 2x2,
chipIds = [4, 5, 6, 7]
>
Reinterpreted Grids (Transpose)
One particularly interesting usecase that logical grids could enable is to reinterpret the grid as a form of data layout optimization. For example, if we wanted to transpose a tensor, instead of having to move the data around to implement transpose, we could instead reinterpret the grid as being transposed, leveraging the fact that the relevant data is already located on the correct cores/chips.
To keep things simple, let's consider a 1 chip system [1, 8x8]
, but it's not
too big a leap to see how this could map to multi-chip where the cost of moving
data is even higher.
Let's also consider a simple (totally contrived) eltwise unary graph:
a = exp(a)
aT = transpose(a)
relu(aT)
- We'll establish a regular, single chip, identity logical grid:
#tt.device<
workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, d1)>,
meshShape = 1,
chipIds = [0]
>
- Execute
exp
. - We'll reinterpret the grid as transposed:
#tt.device<
workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d1, d0)>,
meshShape = 1,
chipIds = [0]
>
- Execute
transpose
. Note that each core only needs to transpose their data locally. Eventually this could be implemented as a no-op by reindexing the tile visitation order of the successive operation. - Execute
relu
.
It's important to note that we effectively implemented transpose without moving data anywhere.
Reinterpreted Grids (Extra)
For the sake of examples, here's a few more ways of reinterpreting the logical grid.
Extra Wide Grid
#tt.device<
workerGrid = #tt.grid<1x64, (d0, d1) -> (0, d0 * 8 + d1 floordiv 8, d1 mod 8)>,
meshShape = 1,
chipIds = [0]
>
Extra Tall + Transposed Grid
#tt.device<
workerGrid = #tt.grid<64x1, (d0, d1) -> (0, d1 * 8 + d0 floordiv 8, d0 mod 8)>,
meshShape = 1,
chipIds = [0]
>
Staircase
#tt.device<
workerGrid = #tt.grid<8x8, (d0, d1) -> (0, d0, (d0 + d1) mod 8)>,
meshShape = 1,
chipIds = [0]
>
This could be an interesting starting position for data in implementing matmul as a systolic array in a ring topology.
Lowering to TTNN
While the above device attribute encoding is quite flexible, this does not necessarily mean the target backend can actually support all of these interpretations. TTNN backend will be constrained to support only the specialized grid topologies that are supported by the API.
Grid/Shard Orientation
TODO
Multi-device
Please refer to TTNN Mesh Programming Docs for more information on how to program multi-device systems with TTNN API.
Multi-device TTNN dialect will try and stay as close to the TTNN API as possible. Let's consider what this looks like from the compiler and runtime perspectives:
Compiler
- Device Creation: The TTNN device in the compiler is exactly the same attribute
from the ttir dialect. It will encode the
meshShape
into the flatbuffer which can be directly used to program::ttnn::MeshShape
. - Tensor Layout: Again, the tensor layout is inherited in TTNN dialect from the
ttir dialect. The grid attribute in the tensor layout can be trivially
divided by
meshShape
to determine the shape of the tensor slice on each device. Broadcasting rules can be applied to determine which Distribution Strategy to use:- Mesh Sharded: If the tensor grid is > 1 along the
meshShape
dimensions, the tensor will be sharded across the mesh devices. - Replication: If the tensor needs to be broadcasted for this op, by extension the tensor layout will be replicated across the mesh devices.
- Mesh Sharded: If the tensor grid is > 1 along the
Runtime
- Device Creation: The ttnn runtime will wholesale switch to working with
mesh devices via api
ttnn::multi_device::open_mesh_device
, this is possible because a 1x1 mesh device is a valid single device. The mesh shape during device open will always be1xN
whereN
is the number of deviceIds in the array. Note that this shape can be reinterpreted by flatbuffer programs on the fly withSubMesh
API. - Tensor Creation: Tensor creation in a multi-device system is a bit more
involved. In order to upload a multi-device tensor to the mesh, the host
tensor much first be created with
MultiDeviceHostStorage
. The ttnn runtime can automatically do this duringhandleToHostMemoryConfigOp
:- Regular host tensor will bounce through new tensor with
MultiDeviceHostStorage
type. tensor.to(mesh_device)
will allocate/move the tensor to the mesh device.
- Regular host tensor will bounce through new tensor with
Lowering to TTMetal
In TTMetal dialect we are only constrained by what we've implemented in the tt-mlir compiler, this means it is much more flexible and can theoretically support any of the grid interpretations above.
Test Plan
test/python/device_attr.py
covers all of the examples above and asserts the IR is correctly generated.- Additional functional unit tests will be added as op and runtime support is added.
Concerns
tt.device
is very flexible, but with this flexibility comes the potential for misuse. It's important that the compiler is able to validate the legal configurations of this attribute for the target backend.
'tt' Dialect
TT types and attributes common to all TT dialects. This dialect defines types and attributes common to all TT dialects.
[TOC]
ArchAttr
TT Arch
Syntax:
#tt.arch<
::mlir::tt::Arch # value
>
Enum cases:
- grayskull (
Grayskull
) - wormhole_b0 (
WormholeB0
) - blackhole (
Blackhole
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::Arch | an enum of type Arch |
ArgumentAllocationAttr
Argument allocation attribute in TT dialect
Syntax:
#tt.arg_alloc<
uint64_t, # address
uint64_t, # size
MemorySpace # memorySpace
>
Holds the metadata for the allocation of an function argument i.e. for graph inputs.
Parameters:
Parameter | C++ type | Description |
---|---|---|
address | uint64_t | |
size | uint64_t | |
memorySpace | MemorySpace |
BufferAccessAttr
TT Buffer Access
Syntax:
#tt.buffer_access<
::mlir::tt::BufferAccess # value
>
Enum cases:
- alias (
Alias
) - stream (
Stream
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::BufferAccess | an enum of type BufferAccess |
BufferAttr
Buffer attribute in TT dialect
Syntax:
#tt.buffer<
MemRefType, # memref
BufferAccess # buffer_access
>
Describes the physical footprint and layout of a buffer in L1. Its memref must also have a shape with rank equal to DeviceAttr grid. It also carries a buffer access attribute which can be one of:
- Alias: This buffer aliases a persistent Tensor L1 allocation directly. Implies that no datamovement occurs and the compute kernel just accesses the local allocation directly.
- Stream: This buffer is a temporary destination as a means to get remote data for local computation. Remote data is most likely a a tensor that is allocated in dram, but could also be data from a remote core.
Parameters:
Parameter | C++ type | Description |
---|---|---|
memref | MemRefType | A memref that describes the physical footprint and layout of the buffer. It must also have a shape with rank equal to DeviceAttr grid. |
buffer_access | BufferAccess | How data is accessed through this buffer, alias or stream. |
ChipChannelAttr
TT chip_channel attribute
Syntax:
#tt.chip_channel<
unsigned, # deviceId0
::llvm::ArrayRef<int64_t>, # ethernetCoreCoord0
unsigned, # deviceId1
::llvm::ArrayRef<int64_t> # ethernetCoreCoord1
>
TT chip_channel attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
deviceId0 | unsigned | |
ethernetCoreCoord0 | ::llvm::ArrayRef<int64_t> | |
deviceId1 | unsigned | |
ethernetCoreCoord1 | ::llvm::ArrayRef<int64_t> |
ChipCoordAttr
TT chip_coord attribute
Syntax:
#tt.chip_coord<
unsigned, # rack
unsigned, # shelf
unsigned, # y
unsigned # x
>
TT chip_coord attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
rack | unsigned | |
shelf | unsigned | |
y | unsigned | |
x | unsigned |
ChipDescAttr
TT chip_desc attribute
Syntax:
#tt.chip_desc<
ArchAttr, # arch
::llvm::ArrayRef<int64_t>, # grid
unsigned, # l1Size
unsigned, # numDramChannels
unsigned, # dramChannelSize
unsigned, # nocL1AddressAlignBytes
unsigned, # pcieAddressAlignBytes
unsigned, # nocDRAMAddressAlignBytes
unsigned, # l1UnreservedBase
unsigned, # eriscL1UnreservedBase
unsigned, # dramUnreservedBase
unsigned, # dramUnreservedEnd
ChipPhysicalCoresAttr, # chipPhysicalCores
::llvm::ArrayRef<DataTypeAttr>, # supportedDataTypes
::llvm::ArrayRef<TileSizeAttr> # supportedTileSizes
>
TT chip_desc attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
arch | ArchAttr | |
grid | ::llvm::ArrayRef<int64_t> | |
l1Size | unsigned | |
numDramChannels | unsigned | |
dramChannelSize | unsigned | |
nocL1AddressAlignBytes | unsigned | |
pcieAddressAlignBytes | unsigned | |
nocDRAMAddressAlignBytes | unsigned | |
l1UnreservedBase | unsigned | |
eriscL1UnreservedBase | unsigned | |
dramUnreservedBase | unsigned | |
dramUnreservedEnd | unsigned | |
chipPhysicalCores | ChipPhysicalCoresAttr | |
supportedDataTypes | ::llvm::ArrayRef<DataTypeAttr> | |
supportedTileSizes | ::llvm::ArrayRef<TileSizeAttr> |
ChipPhysicalCoresAttr
TT chip_physical_cores attribute
Syntax:
#tt.chip_physical_cores<
::llvm::ArrayRef<CoreCoordAttr>, # worker
::llvm::ArrayRef<CoreCoordAttr>, # dram
::llvm::ArrayRef<CoreCoordAttr>, # eth
::llvm::ArrayRef<CoreCoordAttr> # eth_inactive
>
TT chip_physical_cores attribute containing arrays of physical cores by core type in order of logical cores.
Parameters:
Parameter | C++ type | Description |
---|---|---|
worker | ::llvm::ArrayRef<CoreCoordAttr> | |
dram | ::llvm::ArrayRef<CoreCoordAttr> | |
eth | ::llvm::ArrayRef<CoreCoordAttr> | |
eth_inactive | ::llvm::ArrayRef<CoreCoordAttr> |
CoreCoordAttr
TT core_coord attribute
Syntax:
#tt.core_coord<
int64_t, # y
int64_t # x
>
TT core_coord attribute containing a single physical core coordinate.
Parameters:
Parameter | C++ type | Description |
---|---|---|
y | int64_t | |
x | int64_t |
DataTypeAttr
TT DataTypes
Syntax:
#tt.supportedDataTypes<
::mlir::tt::DataType # value
>
Enum cases:
- f32 (
Float32
) - f16 (
Float16
) - bf16 (
BFloat16
) - bfp_f8 (
BFP_Float8
) - bfp_bf8 (
BFP_BFloat8
) - bfp_f4 (
BFP_Float4
) - bfp_bf4 (
BFP_BFloat4
) - bfp_f2 (
BFP_Float2
) - bfp_bf2 (
BFP_BFloat2
) - u32 (
UInt32
) - u16 (
UInt16
) - u8 (
UInt8
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::DataType | an enum of type DataType |
DeviceAttr
Device attribute in TT dialect.
Syntax:
#tt.device<
::mlir::tt::GridAttr, # workerGrid
AffineMap, # l1Map
AffineMap, # dramMap
::llvm::ArrayRef<int64_t>, # meshShape
::llvm::ArrayRef<unsigned> # chipIds
>
Describes the physical layout of a device in the system and is made up of a few components:
- A grid attribute that describes the device's compute grid shape. It not only describes the shape of the compute grid, but also carries an affine map that describes how the logical grid maps to the physical grid.
- Two affine maps that describe how a tensor layout's linear attribute maps to the L1 and DRAM memory spaces.
- A mesh shape that describes the virtual layout of the chips with respect to each other. Note that in a multi-chip system, this grid encapsulates the entire system's grid shape, e.g. 8x16 grid could be made up of a 1x2 mesh of chips side-by-side. The mesh attribute configures how the above grid/map attributes are created such that they implement this mesh topology.
- An array of chip ids that this device is made up of. This array's length must match the volume of the mesh shape and should be interpreted in row-major order.
Parameters:
Parameter | C++ type | Description |
---|---|---|
workerGrid | ::mlir::tt::GridAttr | TT grid attribute |
l1Map | AffineMap | |
dramMap | AffineMap | |
meshShape | ::llvm::ArrayRef<int64_t> | |
chipIds | ::llvm::ArrayRef<unsigned> |
GridAttr
TT grid attribute
Syntax:
#tt.grid<
::llvm::ArrayRef<int64_t>, # shape
AffineMap # mapping
>
TT grid attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
shape | ::llvm::ArrayRef<int64_t> | |
mapping | AffineMap |
IteratorTypeAttr
TT IteratorType
Syntax:
#tt.iterator_type<
::mlir::tt::IteratorType # value
>
Enum cases:
- parallel (
Parallel
) - systolic (
Systolic
) - broadcast (
Broadcast
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::IteratorType | an enum of type IteratorType |
LayoutAttr
Tensor layout attribute
Syntax:
#tt.layout<
AffineMap, # linear
OOBVal, # oob_val
GridAttr, # grid
MemRefType, # memref
TensorMemoryLayout # mem_layout
>
The tensor layout attribute captures how tensor data is sharded across a grid of devices, cores, and is laid out in memory.
Some high level goals
- Logical shapes: Keep the original tensor shape and rank intact and agnostic to underlying storage layout. Keeping the logical shapes not only makes some graph transformations vastly simpler, in particular convs, but it makes the lowered IR much easier to read and reason about. The original tensor shapes leave breadcrumbs that make it much easier to map back to the input representation.
- Flexible sharding: Enable flexibility in choosing grid shape, to get better parallelization and avoid resharding. This is particularly important in cases where tensor shapes are not clean powers of two and would otherwise force our hand in choosing non-optimal grid shapes.
- Logical-Physical Isomorphism: Encode this information with just a few attributes to enable derived conversions from logical to physical layout and back.
- Explicit: A single source of truth.
- Enable a direct way to query padded regions.
Please refer to the Tensor Layout Spec for more in depth documentation.
Examples:
tensor<8x300xf32,
#tt.layout<(d0, d1) -> (d0, d1),
undef,
<1x2>,
memref<8x150xf32, #tt.memory_space<l1>>
>
>
tensor<8x96x32xf32,
#tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d2),
undef,
<2x1>,
memref<384x32xf32, #tt.memory_space<l1>>
>
>
tensor<8x96x32xf32,
#tt.layout<(d0, d1, d2) -> (d0 * 96 + d1, d1, d2),
undef,
<2x1x2>,
memref<384x96x16xf32, #tt.memory_space<l1>>
>
>
tensor<5x3x2x2x7x32x32xf32,
#tt.layout<
(d0, d1, d2, d3, d4, d5, d6)
-> (d0 * 2688 + d1 * 896 + d2 * 448 + d3 * 224 + d4 * 32 + d5, d4, d5, d6),
undef,
<3x2x2x2>,
memref<4480x4x16x16xf32, #tt.memory_space<l1>>
>
>
Parameters:
Parameter | C++ type | Description |
---|---|---|
linear | AffineMap | An affine map that defines how the logical tensor dimensions map to a grid shape. |
oob_val | OOBVal | A tracked out of bounds value that fills padding space. |
grid | GridAttr | The grid shape that this tensor is divided onto. |
memref | MemRefType | A memref that describes the physical footprint allocation of the shard. It must also have a shape with rank equal to grid. |
mem_layout | TensorMemoryLayout | The layout of the tensor in memory. |
MemorySpaceAttr
TT MemorySpace
Syntax:
#tt.memory_space<
::mlir::tt::MemorySpace # value
>
Enum cases:
- system (
System
) - mmio (
SystemMMIO
) - dram (
DeviceDRAM
) - l1 (
DeviceL1
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::MemorySpace | an enum of type MemorySpace |
OOBValAttr
TT OOBVal
Syntax:
#tt.oob_val<
::mlir::tt::OOBVal # value
>
Enum cases:
- undef (
Undef
) - zero (
Zero
) - one (
One
) - inf (
Inf
) - neginf (
NegInf
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::OOBVal | an enum of type OOBVal |
OperandConstraintAttr
TT Operand Constraints
Syntax:
#tt.operand_constraint<
::mlir::tt::OperandConstraint # value
>
Enum cases:
- system (
System
) - dram (
DRAM
) - l1 (
L1
) - scalar (
Scalar
) - tile (
Tile
) - none (
None
) - interleaved (
Interleaved
) - single_bank (
SingleBank
) - height_sharded (
HeightSharded
) - width_sharded (
WidthSharded
) - block_sharded (
BlockSharded
) - any_layout (
AnyLayout
) - any (
Any
) - any_device (
AnyDevice
) - any_device_tile (
AnyDeviceTile
) - l1_block_sharded (
L1BlockSharded
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::OperandConstraint | an enum of type OperandConstraint |
SystemDescAttr
TT system_desc attribute
Syntax:
#tt.system_desc<
::llvm::ArrayRef<ChipDescAttr>, # chipDescs
::llvm::ArrayRef<unsigned>, # chipDescIndices
::llvm::ArrayRef<ChipCapabilityAttr>, # chipCapabilities
::llvm::ArrayRef<ChipCoordAttr>, # chipCoords
::llvm::ArrayRef<ChipChannelAttr> # chipChannels
>
TT system_desc attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
chipDescs | ::llvm::ArrayRef<ChipDescAttr> | |
chipDescIndices | ::llvm::ArrayRef<unsigned> | |
chipCapabilities | ::llvm::ArrayRef<ChipCapabilityAttr> | |
chipCoords | ::llvm::ArrayRef<ChipCoordAttr> | |
chipChannels | ::llvm::ArrayRef<ChipChannelAttr> |
TensorMemoryLayoutAttr
TT TensorMemoryLayout
Syntax:
#tt.tensor_memory_layout<
::mlir::tt::TensorMemoryLayout # value
>
Enum cases:
- none (
None
) - interleaved (
Interleaved
) - single_bank (
SingleBank
) - height_sharded (
HeightSharded
) - width_sharded (
WidthSharded
) - block_sharded (
BlockSharded
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::TensorMemoryLayout | an enum of type TensorMemoryLayout |
TileSizeAttr
TT tile_size attribute
Syntax:
#tt.tile_size<
int64_t, # y
int64_t # x
>
TT tile_size attribute containing a supported Tensix tile shape.
Parameters:
Parameter | C++ type | Description |
---|---|---|
y | int64_t | |
x | int64_t |
DeviceType
TT device
Syntax:
!tt.device<
::mlir::tt::DeviceAttr # desc
>
Device type in TT dialect
Parameters:
Parameter | C++ type | Description |
---|---|---|
desc | ::mlir::tt::DeviceAttr | Device attribute in TT dialect. |
TileType
TT tile
Syntax:
!tt.tile<
::llvm::ArrayRef<int64_t>, # shape
DataType # dataType
>
Tile type in TT dialect
Parameters:
Parameter | C++ type | Description |
---|---|---|
shape | ::llvm::ArrayRef<int64_t> | |
dataType | DataType |
'ttir' Dialect
TTIR dialect provides high level semantics for dispatching work to TT HW. This dialect provides high level semantics for dispatching work to TT HW. It defines a set of declarative/high level operations that are used to describe the dispatch, but is largely agnostic to the set of operations or dialects that are actually supported by a consuming backend.
[TOC]
ttir.abs
(tt::ttir::AbsOp)
Eltwise absolute op.
Eltwise absolute operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.add
(tt::ttir::AddOp)
Eltwise add.
Eltwise add operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
, TTIR_GenericRegionOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.alloc
(tt::ttir::AllocOp)
Alloc op.
Tensor Alloc operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
address | ::mlir::IntegerAttr | 64-bit signless integer attribute |
size | ::mlir::IntegerAttr | 64-bit signless integer attribute |
memory_space | ::mlir::tt::MemorySpaceAttr | TT MemorySpace{{% markdown %}}Enum cases: * system (`System`) * mmio (`SystemMMIO`) * dram (`DeviceDRAM`) * l1 (`DeviceL1`){{% /markdown %}} |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.broadcast
(tt::ttir::BroadcastOp)
Broadcast operation.
Broadcast op.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::ArrayAttr | 64-bit integer array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.concat
(tt::ttir::ConcatOp)
Concat op.
Concat tensors along a given dimension.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.constant
(tt::ttir::ConstantOp)
Constant op.
Produces tensor filled with given constant value.
Examples: %0 = "ttir.constant"() {value = dense<0> : tensor<2x3xi32>} : () -> tensor<2x3xi32> // %0: [[0, 0, 0], [0, 0, 0]] %1 = "ttir.constant"() {value = dense<[0.2, 1.3]> : tensor<2xf32>} : () -> tensor<2xf32> // %1: [0.2, 1.3]
Traits: ConstantLike
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
value | ::mlir::ElementsAttr | constant vector/tensor attribute |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.conv2d
(tt::ttir::Conv2dOp)
Conv2d operation.
Applies a 2D convolution over an input image composed of several input planes.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
stride_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
stride_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
groups | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_left | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_right | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_top | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_bottom | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
weight | ranked tensor of any type values |
bias | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.dealloc
(tt::ttir::DeallocOp)
Dealloc op.
Tensor Dealloc operation
Operands:
Operand | Description |
---|---|
result | ranked tensor of any type values |
ttir.div
(tt::ttir::DivOp)
Eltwise divide.
Eltwise divide operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.embedding
(tt::ttir::EmbeddingOp)
Embedding op.
Embedding operation.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
weight | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.exp
(tt::ttir::ExpOp)
Eltwise exponential op.
Eltwise exponential operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.generic
(tt::ttir::GenericOp)
Generically dispatch work to a grid of cores.
This generic op carries a region that represents the work each core does. The region is expected to have the same signature as the op itself. The op is expected to be lowered to a backend specific form by a consuming backend. This op is heavily inspired by the linalg.generic op so it can be useful to refer to linalg.generic documentation for more details.
%5 = "ttir.generic"(%1, %3, %4) <{
grid = #tt.grid<1x1>, // The grid range of cores to dispatch work to.
indexing_maps = [#map, #map, #map], // Affine maps for indexing into the input/output tensors. See linalg.generic
iterator_types = [#parallel, #parallel], // Iterator types for the input/output tensors. See linalg.generic
operandSegmentSizes = array<i32: 2, 1>, // Sizes of the operand segments, i.e. 2 inputs and 1 output.
({
^bb0(%arg2: memref<64x128xf32, #l1_>, %arg3: memref<64x128xf32, #l1_>, %arg4: memref<64x128xf32, #l1_>):
// Region body, would contain some computation that represents the work each core does.
}) : (tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
grid | ::mlir::tt::GridAttr | TT grid attribute{{% markdown %}} TT grid attribute {{% /markdown %}} |
indexing_maps | ::mlir::ArrayAttr | AffineMap array attribute |
iterator_types | ::mlir::ArrayAttr | |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.ge
(tt::ttir::GreaterEqualOp)
Eltwise greater than or equal to.
Eltwise greater than or equal to operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.kernel
(tt::ttir::KernelOp)
Kernel call.
A generic kernel call operation. This operation is used to pattern match by some consuming backend.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
op | ::mlir::FlatSymbolRefAttr | flat symbol reference attribute |
kind | ::mlir::FlatSymbolRefAttr | flat symbol reference attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values or non-0-ranked.memref of any type values |
outputs | variadic of ranked tensor of any type values or non-0-ranked.memref of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values or non-0-ranked.memref of any type values |
ttir.matmul
(tt::ttir::MatmulOp)
Matrix multiply operation.
Matrix multiply operation.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
a | ranked tensor of any type values |
b | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.max
(tt::ttir::MaxOp)
Max reduction op.
Max reduction op.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.max_pool2d
(tt::ttir::MaxPool2dOp)
Applies a 2D max pooling over an input signal composed of several input planes.
Applies a 2D max pooling over an input signal composed of several input planes.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
kernel_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
kernel_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
stride_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
stride_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
ceil_mode | ::mlir::BoolAttr | bool attribute |
padding_left | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_right | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_top | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_bottom | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr | |
original_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
original_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.maximum
(tt::ttir::MaximumOp)
Eltwise maximum OP.
Calculates maximum of input tensors' values element-wise and stores result in output tensor.
Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttir.maximum"(%lhs, %rhs, %out) -> %out: [[3, 4, 7], [1, 4, 4]]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.mean
(tt::ttir::MeanOp)
Mean reduction op.
Mean reduction op.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.multiply
(tt::ttir::MultiplyOp)
Eltwise multiply.
Eltwise multiply operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
, TTIR_GenericRegionOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.neg
(tt::ttir::NegOp)
Eltwise negate op.
Eltwise negate operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.reciprocal
(tt::ttir::ReciprocalOp)
Eltwise reciprocal.
Eltwise reciprocal operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.relu
(tt::ttir::ReluOp)
Eltwise ReLU.
Eltwise ReLU operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.reshape
(tt::ttir::ReshapeOp)
Reshape op.
Reshape tensor.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
shape | ::mlir::ArrayAttr | 32-bit integer array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.sigmoid
(tt::ttir::SigmoidOp)
Eltwise sigmoid.
Eltwise sigmoid operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.softmax
(tt::ttir::SoftmaxOp)
Softmax operation.
Softmax operation.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.sqrt
(tt::ttir::SqrtOp)
Eltwise square root.
Eltwise square root operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.squeeze
(tt::ttir::SqueezeOp)
Squeeze op.
Squeeze tensor.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.subtract
(tt::ttir::SubtractOp)
Eltwise subtract.
Eltwise subtract operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
, TTIR_ElementwiseOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttir.sum
(tt::ttir::SumOp)
Sum reduction op.
Sum reduction op.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.to_layout
(tt::ttir::ToLayoutOp)
Layout op.
ToLayout operation, transition tensors from one layout to another. Some examples include:
- Transitioning between different memory spaces, e.g. DRAM to L1.
- Transitioning between different data types, e.g. f32 to f16.
- Transitioning between different tile sizes, e.g. 1x16 to 32x32
- Transitioning between different tensor sharding
- Some combination of the above
#layout = #tt.layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #system>>
#layout1 = #tt.layout<8192x128x1, undef, <1x1>, memref<64x128xf32, #l1_>>
%1 = "ttir.to_layout"(%arg0, %0) : (tensor<64x128xf32, #layout>, tensor<64x128xf32, #layout1>) -> tensor<64x128xf32, #layout1>
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.transpose
(tt::ttir::TransposeOp)
Transpose op.
Transpose tensor along two given dimensions.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim0 | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dim1 | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.unsqueeze
(tt::ttir::UnsqueezeOp)
Unsqueeze op.
Unsqueeze tensor.
Interfaces: DestinationStyleOpInterface
, TTIROpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim | ::mlir::IntegerAttr | 32-bit signed integer attribute |
operand_constraints | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttir.yield
(tt::ttir::YieldOp)
Yield op.
Yield operation, this is required by MLIR to mark the end of a dispatch region.
Traits: AlwaysSpeculatableImplTrait
, ReturnLike
, Terminator
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
, RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
Operands:
Operand | Description |
---|---|
values | variadic of ranked tensor of any type values or non-0-ranked.memref of any type values |
'ttkernel' Dialect
A TTKernel out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.
[TOC]
EthernetConfigAttr
TT EthernetConfig attribute
Syntax:
#ttkernel.ethernet_config<
EthType, # eth_type
NocIndex # noc_index
>
TT ethernet_config attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
eth_type | EthType | |
noc_index | NocIndex |
NocConfigAttr
TT NocConfig attribute
Syntax:
#ttkernel.noc_config<
NocIndex # noc_index
>
TT noc_config attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
noc_index | NocIndex |
TensixConfigAttr
TT TensixConfig attribute
Syntax:
#ttkernel.tensix_config<
MathFidelity, # math_fidelity
bool, # fp32_dest_acc_en
bool, # preserve_fp32_precision
bool # math_approx_mode
>
TT compute_desc attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
math_fidelity | MathFidelity | |
fp32_dest_acc_en | bool | |
preserve_fp32_precision | bool | |
math_approx_mode | bool |
ThreadTypeAttr
TTKernel ThreadTypes
Syntax:
#ttkernel.thread<
::mlir::tt::ttkernel::ThreadType # value
>
Enum cases:
- noc (
Noc
) - tensix (
Tensix
) - ethernet (
Ethernet
)
Parameters:
Parameter | C++ type | Description |
---|---|---|
value | ::mlir::tt::ttkernel::ThreadType | an enum of type ThreadType |
ttkernel.acquire_dst
(tt::ttkernel::AcquireDstOp)
Aquire dest call.
Aquire dest operation
ttkernel.add
(tt::ttkernel::AddOp)
Add operation
Add operation
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
ttkernel.add_tiles_init
(tt::ttkernel::AddTilesInitOp)
Short init function
Must be run before add_tiles.
Operands:
Operand | Description |
---|---|
in0_cb | TTKernel cb |
in1_cb | TTKernel cb |
ttkernel.add_tiles
(tt::ttkernel::AddTilesOp)
Add operation
Performs element-wise addition C=A+B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.
Operands:
Operand | Description |
---|---|
in0_cb | TTKernel cb |
in1_cb | TTKernel cb |
in0_tile_index | 32-bit signless integer |
in1_tile_index | 32-bit signless integer |
dst_index | 32-bit signless integer |
ttkernel.binary_op_init_common
(tt::ttkernel::BinaryOpInitCommonOp)
Init function for all binary ops
Followed by the specific init required with an opcode (binrary_op_specific_init).
Operands:
Operand | Description |
---|---|
in0_cb | TTKernel cb |
in1_cb | TTKernel cb |
out_cb | TTKernel cb |
ttkernel.builtin
(tt::ttkernel::BuiltinOp)
Builtin call.
Kernel operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
op | ::mlir::FlatSymbolRefAttr | flat symbol reference attribute |
kind | ::mlir::FlatSymbolRefAttr | flat symbol reference attribute |
Operands:
Operand | Description |
---|---|
args | variadic of non-0-ranked.memref of any type values or TTKernel cb |
ttkernel.cb_pop_front
(tt::ttkernel::CBPopFrontOp)
CBPopFront call.
CBPopFront operation
Operands:
Operand | Description |
---|---|
cb | TTKernel cb |
numPages | 32-bit signless integer |
ttkernel.cb_push_back
(tt::ttkernel::CBPushBackOp)
CBPushBack call.
CBPushBack operation
Operands:
Operand | Description |
---|---|
cb | TTKernel cb |
numPages | 32-bit signless integer |
ttkernel.cb_reserve_back
(tt::ttkernel::CBReserveBackOp)
CBReserveBack call.
CBReserveBack operation
Operands:
Operand | Description |
---|---|
cb | TTKernel cb |
numPages | 32-bit signless integer |
ttkernel.cb_wait_front
(tt::ttkernel::CBWaitFrontOp)
CBWaitFront call.
CBWaitFront operation
Operands:
Operand | Description |
---|---|
cb | TTKernel cb |
numPages | 32-bit signless integer |
ttkernel.get_noc_addr
(tt::ttkernel::GetNocAddrOp)
GetNocAddr
GetNocAddr
Interfaces: InferTypeOpInterface
Operands:
Operand | Description |
---|---|
x | 32-bit signless integer |
y | 32-bit signless integer |
l1Address | 32-bit signless integer |
Results:
Result | Description |
---|---|
nocAddr | TTKernel noc address |
ttkernel.matmul
(tt::ttkernel::MatmulOp)
Matmul operation
Matmul operation
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
ttkernel.mul
(tt::ttkernel::MulOp)
Mul operation
Mul operation
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
ttkernel.mul_tiles_init
(tt::ttkernel::MulTilesInitOp)
Short init function
Must be run before mul_tiles.
Operands:
Operand | Description |
---|---|
in0_cb | TTKernel cb |
in1_cb | TTKernel cb |
ttkernel.mul_tiles
(tt::ttkernel::MulTilesOp)
Mul operation
Performs element-wise multiplication C=A*B of tiles in two CBs at given indices and writes the result to the DST register at index dst_tile_index. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.
Operands:
Operand | Description |
---|---|
in0_cb | TTKernel cb |
in1_cb | TTKernel cb |
in0_tile_index | 32-bit signless integer |
in1_tile_index | 32-bit signless integer |
dst_index | 32-bit signless integer |
ttkernel.noc_async_read_barrier
(tt::ttkernel::NocAsyncReadBarrierOp)
NocAsyncReadBarrier
NocAsyncReadBarrier
ttkernel.noc_async_read
(tt::ttkernel::NocAsyncReadOp)
NocAsyncRead
NocAsyncRead
Operands:
Operand | Description |
---|---|
srcNocAddr | TTKernel noc address |
dstLocalL1Addr | 32-bit signless integer |
size | 32-bit signless integer |
ttkernel.noc_async_write_barrier
(tt::ttkernel::NocAsyncWriteBarrierOp)
NocAsyncWriteBarrier
NocAsyncWriteBarrier
ttkernel.noc_async_write
(tt::ttkernel::NocAsyncWriteOp)
NocAsyncWrite
NocAsyncWrite
Operands:
Operand | Description |
---|---|
srcLocalL1Addr | 32-bit signless integer |
dstNocAddr | TTKernel noc address |
size | 32-bit signless integer |
ttkernel.pack
(tt::ttkernel::PackOp)
Pack op.
Pack operation
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
out_cb | TTKernel cb |
out_index | 32-bit signless integer |
ttkernel.pack_set_data_type
(tt::ttkernel::PackSetDataTypeOp)
Pack set DataType op.
Pack set DataType operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
data_type | ::mlir::IntegerAttr | TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}} |
ttkernel.pack_tile
(tt::ttkernel::PackTileOp)
PackTile op.
Copies a single tile from the DST register buffer at a specified index to a specified CB at a given index. For the out_tile_index to be valid for this call, cb_reserve_back(n) has to be called first to reserve at least some number n > 0 of tiles in the output CB. out_tile_index = 0 then references the first tile in the reserved section of the CB, up to index n - 1, which will then be visible to the consumer in the same order after a cb_push_back call. The DST register buffer must be in acquired state via acquire_dst call. This call is blocking and is only available on the compute engine.
Each subsequent pack call will increment the write pointer in the cb by single tile size. The pointer is then again set to a valid position with space for n reserved tiles by another cb_reserve_back call.
Operates in tandem with functions cb_reserve_back and cb_push_back.
A typical use case is first the producer ensures that there is a number of tiles available in the buffer via cb_reserve_back, then the producer uses the pack_tile call to copy a tile from one of DST slots to a slot in reserved space and finally cb_push_back is called to announce visibility of the reserved section of the circular buffer to the consumer.
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
out_cb | TTKernel cb |
out_index | 32-bit signless integer |
ttkernel.release_dst
(tt::ttkernel::ReleaseDstOp)
Release dest call.
Release dest operation
ttkernel.return
(tt::ttkernel::ReturnOp)
Return op.
Return operation
Traits: AlwaysSpeculatableImplTrait
, ReturnLike
, Terminator
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
, RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
ttkernel.sub
(tt::ttkernel::SubOp)
Sub operation
Sub operation
Operands:
Operand | Description |
---|---|
dst_index | 32-bit signless integer |
ttkernel.tile_regs_acquire
(tt::ttkernel::TileRegsAcquireOp)
Tile_regs_acquire
Acquire an exclusive lock on the DST register for the MATH thread. This register is an array of 16 tiles of 32x32 elements each. This is a blocking function, i.e. this function will wait until the lock is acquired.
ttkernel.tile_regs_commit
(tt::ttkernel::TileRegsCommitOp)
Tile_regs_commit
Release lock on DST register by MATH thread. The lock had to be previously acquired with tile_regs_acquire.
ttkernel.tile_regs_release
(tt::ttkernel::TileRegsReleaseOp)
Tile_regs_release
Release lock on DST register by PACK thread. The lock had to be previously acquired with tile_regs_wait.
ttkernel.tile_regs_wait
(tt::ttkernel::TileRegsWaitOp)
Tile_regs_wait
Acquire an exclusive lock on the DST register for the PACK thread. It waits for the MATH thread to commit the DST register. This is a blocking function, i.e. this function will wait until the lock is acquired.
ttkernel.tilize_block
(tt::ttkernel::TilizeBlockOp)
TilizeBlockOp call.
TilizeBlockOp operation
Operands:
Operand | Description |
---|---|
cbIn | TTKernel cb |
numTiles | 32-bit signless integer |
cbOut | TTKernel cb |
ttkernel.tilize_init
(tt::ttkernel::TilizeInitOp)
TilizeInitOp call.
TilizeInitOp operation
Operands:
Operand | Description |
---|---|
cbIn | TTKernel cb |
numTiles | 32-bit signless integer |
cbOut | TTKernel cb |
ttkernel.unpack_ab
(tt::ttkernel::UnpackABOp)
UnpackAB op.
UnpackAB operation
Operands:
Operand | Description |
---|---|
cb_a | TTKernel cb |
src_a_index | 32-bit signless integer |
cb_b | TTKernel cb |
src_b_index | 32-bit signless integer |
ttkernel.unpack_a
(tt::ttkernel::UnpackAOp)
UnpackA op.
UnpackA operation
Operands:
Operand | Description |
---|---|
cb | TTKernel cb |
src_index | 32-bit signless integer |
ttkernel.unpack_set_data_type
(tt::ttkernel::UnpackSetDataTypeOp)
Unpack set DataType op.
Unpack set DataType operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
data_type_a | ::mlir::IntegerAttr | TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}} |
data_type_b | ::mlir::IntegerAttr | TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}} |
ttkernel.unreachable
(tt::ttkernel::UnreachableOp)
Unreachable op.
Unreachable operation
Traits: AlwaysSpeculatableImplTrait
, ReturnLike
, Terminator
Interfaces: ConditionallySpeculatable
, NoMemoryEffect (MemoryEffectOpInterface)
, RegionBranchTerminatorOpInterface
Effects: MemoryEffects::Effect{}
ttkernel.untilize_block
(tt::ttkernel::UntilizeBlockOp)
UntilizeBlockOp call.
UntilizeBlockOp operation
Operands:
Operand | Description |
---|---|
cbIn | TTKernel cb |
numTiles | 32-bit signless integer |
cbOut | TTKernel cb |
ttkernel.untilize_init
(tt::ttkernel::UntilizeInitOp)
UntilizeInitOp call.
UntilizeInitOp operation
Operands:
Operand | Description |
---|---|
cbIn | TTKernel cb |
cbOut | TTKernel cb |
CBType
TTKernel cb
Syntax:
!ttkernel.cb<
CBPort, # port
uint64_t, # address
MemRefType, # memref
uint64_t, # page_size
uint64_t # num_buffers
>
Circular buffer type in TTKernel dialect
Parameters:
Parameter | C++ type | Description |
---|---|---|
port | CBPort | |
address | uint64_t | |
memref | MemRefType | |
page_size | uint64_t | |
num_buffers | uint64_t |
NocAddrType
TTKernel noc address
Syntax: !ttkernel.noc_addr
Noc address type in TTKernel dialect
'ttmetal' Dialect
A TTMetal out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.
[TOC]
CoreRangeAttr
TTMetal grid attribute
Syntax:
#ttmetal.core_range<
::llvm::ArrayRef<int64_t>, # offset
::llvm::ArrayRef<int64_t> # size
>
TTMetal grid attribute
Parameters:
Parameter | C++ type | Description |
---|---|---|
offset | ::llvm::ArrayRef<int64_t> | |
size | ::llvm::ArrayRef<int64_t> |
ttmetal.alloc
(tt::ttmetal::AllocOp)
Alloc op.
Tensor Alloc operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
address | ::mlir::IntegerAttr | 64-bit signless integer attribute |
size | ::mlir::IntegerAttr | 64-bit signless integer attribute |
memory_space | ::mlir::tt::MemorySpaceAttr | TT MemorySpace{{% markdown %}}Enum cases: * system (`System`) * mmio (`SystemMMIO`) * dram (`DeviceDRAM`) * l1 (`DeviceL1`){{% /markdown %}} |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttmetal.dealloc
(tt::ttmetal::DeallocOp)
Dealloc op.
Tensor Dealloc operation
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
ttmetal.dispatch
(tt::ttmetal::DispatchOp)
Dispatch op.
Dispatch operation
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
core_ranges | ::mlir::ArrayAttr | |
kernelConfigs | ::mlir::ArrayAttr |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttmetal.host_read
(tt::ttmetal::HostReadOp)
Host read op.
Host read operation
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttmetal.host_write
(tt::ttmetal::HostWriteOp)
Host write op.
Host write operation
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
'ttnn' Dialect
A TTNN out-of-tree MLIR dialect. This dialect is an example of an out-of-tree MLIR dialect designed to illustrate the basic setup required to develop MLIR-based tools without working inside of the LLVM source tree.
[TOC]
ttnn.abs
(tt::ttnn::AbsOp)
Eltwise absolute.
Eltwise absolute operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.add
(tt::ttnn::AddOp)
Eltwise add.
Eltwise add operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.alloc
(tt::ttnn::AllocOp)
Alloc op.
Tensor Alloc operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
address | ::mlir::IntegerAttr | 64-bit signless integer attribute |
size | ::mlir::IntegerAttr | 64-bit signless integer attribute |
memory_space | ::mlir::tt::MemorySpaceAttr | TT MemorySpace{{% markdown %}}Enum cases: * system (`System`) * mmio (`SystemMMIO`) * dram (`DeviceDRAM`) * l1 (`DeviceL1`){{% /markdown %}} |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.concat
(tt::ttnn::ConcatOp)
Concat op.
Concat tensors along a given dimension.
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim | ::mlir::IntegerAttr | 32-bit signed integer attribute |
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.conv2d
(tt::ttnn::Conv2dOp)
Conv2d operation.
Applies a 2D convolution over an input image composed of several input planes.
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
in_channels | ::mlir::IntegerAttr | 32-bit signless integer attribute |
out_channels | ::mlir::IntegerAttr | 32-bit signless integer attribute |
batch_size | ::mlir::IntegerAttr | 32-bit signless integer attribute |
input_height | ::mlir::IntegerAttr | 32-bit signless integer attribute |
input_width | ::mlir::IntegerAttr | 32-bit signless integer attribute |
kernel_height | ::mlir::IntegerAttr | 32-bit signless integer attribute |
kernel_width | ::mlir::IntegerAttr | 32-bit signless integer attribute |
stride_height | ::mlir::IntegerAttr | 32-bit signless integer attribute |
stride_width | ::mlir::IntegerAttr | 32-bit signless integer attribute |
padding_height | ::mlir::IntegerAttr | 32-bit signless integer attribute |
padding_width | ::mlir::IntegerAttr | 32-bit signless integer attribute |
dilation_height | ::mlir::IntegerAttr | 32-bit signless integer attribute |
dilation_width | ::mlir::IntegerAttr | 32-bit signless integer attribute |
groups | ::mlir::IntegerAttr | 32-bit signless integer attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
weight | ranked tensor of any type values |
bias | ranked tensor of any type values |
output | ranked tensor of any type values |
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.dealloc
(tt::ttnn::DeallocOp)
Dealloc op.
Tensor Dealloc operation
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
ttnn.div
(tt::ttnn::DivOp)
Eltwise divide.
Eltwise divide operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.embedding
(tt::ttnn::EmbeddingOp)
Embedding op.
Embedding operation.
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
weight | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.empty
(tt::ttnn::EmptyOp)
Empty op.
Tensor empty operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
shape | ::mlir::tt::ttnn::ShapeAttr | TTNN Shape attribute{{% markdown %}} TTNN shape attribute {{% /markdown %}} |
dtype | ::mlir::tt::DataTypeAttr | TT DataTypes{{% markdown %}}Enum cases: * f32 (`Float32`) * f16 (`Float16`) * bf16 (`BFloat16`) * bfp_f8 (`BFP_Float8`) * bfp_bf8 (`BFP_BFloat8`) * bfp_f4 (`BFP_Float4`) * bfp_bf4 (`BFP_BFloat4`) * bfp_f2 (`BFP_Float2`) * bfp_bf2 (`BFP_BFloat2`) * u32 (`UInt32`) * u16 (`UInt16`) * u8 (`UInt8`){{% /markdown %}} |
layout | ::mlir::tt::ttnn::LayoutAttr | TTNN Layout{{% markdown %}}Enum cases: * row_major (`RowMajor`) * tile (`Tile`) * invalid (`Invalid`){{% /markdown %}} |
memory_config | ::mlir::tt::ttnn::MemoryConfigAttr | TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}} |
Operands:
Operand | Description |
---|---|
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.exp
(tt::ttnn::ExpOp)
Eltwise exponential.
Eltwise exponential operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.full
(tt::ttnn::FullOp)
Full op.
Tensor full operation
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
fillValue | ::mlir::FloatAttr | 32-bit float attribute |
Operands:
Operand | Description |
---|---|
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.get_device
(tt::ttnn::GetDeviceOp)
Get Device op.
This op returns the current runtime device.
Results:
Result | Description |
---|---|
device | TT device |
ttnn.ge
(tt::ttnn::GreaterEqualOp)
Eltwise greater than or equal to.
Eltwise greater than or equal to operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.matmul
(tt::ttnn::MatmulOp)
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
a | ranked tensor of any type values |
b | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.max
(tt::ttnn::MaxOp)
Max reduction op.
Max reduction op.
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.max_pool2d
(tt::ttnn::MaxPool2dOp)
Applies a 2D max pooling over an input signal composed of several input planes.
Applies a 2D max pooling over an input signal composed of several input planes.
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
batch_size | ::mlir::IntegerAttr | 32-bit signed integer attribute |
input_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
input_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
channels | ::mlir::IntegerAttr | 32-bit signed integer attribute |
kernel_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
kernel_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
stride_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
stride_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dilation_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
ceil_mode | ::mlir::BoolAttr | bool attribute |
padding_height | ::mlir::IntegerAttr | 32-bit signed integer attribute |
padding_width | ::mlir::IntegerAttr | 32-bit signed integer attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.maximum
(tt::ttnn::MaximumOp)
Eltwise maximum OP.
Calculates maximum of input tensors' values element-wise and stores result in output tensor.
Example: %lhs: [[3, 2, 7], [1, 4, 4]] %rhs: [[1, 4, 2], [1, 2, 3]] "ttnn.maximum"(%lhs, %rhs, %out) -> %out: [[3, 4, 7], [1, 4, 4]]
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.mean
(tt::ttnn::MeanOp)
Mean reduction op.
Mean reduction op.
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.multiply
(tt::ttnn::MultiplyOp)
Eltwise multiply.
Eltwise multiply operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.neg
(tt::ttnn::NegOp)
Eltwise negate.
Eltwise negate operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.reciprocal
(tt::ttnn::ReciprocalOp)
Eltwise reciprocal.
Eltwise reciprocal operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.relu
(tt::ttnn::ReluOp)
Eltwise ReLU.
Eltwise ReLU operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.reshape
(tt::ttnn::ReshapeOp)
Reshape op.
Reshape tensor.
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
shape | ::mlir::ArrayAttr | 32-bit integer array attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.sigmoid
(tt::ttnn::SigmoidOp)
Eltwise sigmoid.
Eltwise sigmoid operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.softmax
(tt::ttnn::SoftmaxOp)
Softmax op.
Softmax operation.
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dimension | ::mlir::IntegerAttr | 32-bit signed integer attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.sqrt
(tt::ttnn::SqrtOp)
Eltwise sqrt.
Eltwise sqrt operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.subtract
(tt::ttnn::SubtractOp)
Eltwise subtract.
Eltwise subtract operation.
Traits: AttrSizedOperandSegments
Interfaces: DestinationStyleOpInterface
Operands:
Operand | Description |
---|---|
inputs | variadic of ranked tensor of any type values |
outputs | variadic of ranked tensor of any type values |
Results:
Result | Description |
---|---|
results | variadic of ranked tensor of any type values |
ttnn.sum
(tt::ttnn::SumOp)
Sum reduction op.
Sum reduction op.
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
keep_dim | ::mlir::BoolAttr | bool attribute |
dim_arg | ::mlir::ArrayAttr | 32-bit integer array attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.to_device
(tt::ttnn::ToDeviceOp)
ToDevice op.
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
memory_config | ::mlir::tt::ttnn::MemoryConfigAttr | TTNN MemoryConfig attribute{{% markdown %}} TTNN memory config attribute {{% /markdown %}} |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.to_layout
(tt::ttnn::ToLayoutOp)
ToLayout op.
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
layout | ::mlir::tt::ttnn::LayoutAttr | TTNN Layout{{% markdown %}}Enum cases: * row_major (`RowMajor`) * tile (`Tile`) * invalid (`Invalid`){{% /markdown %}} |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.to_memory_config
(tt::ttnn::ToMemoryConfigOp)
ToMemoryConfig op.
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
device | TT device |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |
ttnn.transpose
(tt::ttnn::TransposeOp)
Transpose op.
Transpose tensor along two given dimensions.
Interfaces: DestinationStyleOpInterface
Attributes:
Attribute | MLIR Type | Description |
---|---|---|
dim0 | ::mlir::IntegerAttr | 32-bit signed integer attribute |
dim1 | ::mlir::IntegerAttr | 32-bit signed integer attribute |
Operands:
Operand | Description |
---|---|
input | ranked tensor of any type values |
output | ranked tensor of any type values |
Results:
Result | Description |
---|---|
result | ranked tensor of any type values |