Getting Started

TT-Metalium is a framework for accelerating both ML and non-ML workloads on Tenstorrent hardware. It offers an abstraction between the host system (e.g., desktop with x86 CPU) and Tenstorrent hardware.

TT-Metalium provides a C++ API for writing kernels that run on Tensix hardware; for configuring these kernels; and for executing kernels on the hardware. It gives the programmer full control over the hardware and data movement. This allows code to be optimized for performance and efficiency. Since hardware details are exposed, the programmer can write code that is tailored to their specific use case.

The GitHub repository for the project is located here: https://github.com/tenstorrent/tt-metal

Software Stack Overview

TT-Metalium sits at the foundation of Tenstorrent’s software stack:

Programming Philosophy

Operations on Tenstorrent hardware are typically designed from the bottom up:

  1. Start with a kernel on a single Tensix core

  2. Schedule the kernel across multiple Tensix cores (may require synchronization)

  3. Scale to multiple devices (critical for large model deployment)

The examples in this guide follow this progression, starting simple and building complexity.

Installation

Install dependencies and build the project by following the instructions in the installation guide.

Key Concepts

Pipeline Dataflow Pattern

Kernels on Tenstorrent hardware typically follow a three-stage pipeline:

  1. Reader Kernel (Data Movement): Reads data from DRAM/SRAM into circular buffers

  2. Compute Kernel: Processes data using Matrix/Vector engines

  3. Writer Kernel (Data Movement): Writes results back to DRAM/SRAM

Circular buffers enable communication and synchronization between kernels, acting as FIFO data structures. They allow the reader to fetch new data while compute processes previous data, enabling overlapped execution.

The DRAM Loopback example (Step 1 below) demonstrates the basic data movement pattern. Step 2 introduces the full pipeline with compute kernels and circular buffers.

Quick Start Guide

Basic Usage

Step 1: DRAM Loopback

Learn the basic structure of an Metalium application by implementing a DRAM Loopback that copies data from one DRAM buffer to another (hence loopback). This example will help you understand the basic concepts fundamental to writing Metalium applications.

What you’ll learn: Basic host and kernel structure, buffer management, and data transfer.

Step 2: Eltwise Binary Kernel

Build on the loopback example by implementing an Eltwise Binary Kernel that performs element-wise addition of two buffers. This will introduce you to performing computations using the matrix engine (FPU) and passing data between kernels within a single Tensix core.

What you’ll learn: Circular buffer for data passing, compute kernels, using the matrix engine for computations.

Step 3: Eltwise SFPU

Extend the previous example to implement an Eltwise SFPU kernel that performs element-wise addition using the SFPU (vector engine, Special Function Processing Unit). This will introduce you to the SFPU and how to use it for vectorized operations.

What you’ll learn: Performing operations using the SFPU.

Intermediate Usage

Step 4: Single-core Matrix Multiplication

Implement a Single-core Matrix Multiplication Kernel that performs matrix multiplication using the matrix engine. This will help you understand how to handle complex dataflow and computations on the Tensix core.

What you’ll learn: Complex dataflow, tilized operatins and using the matrix engine for matrix multiplication.

Advanced Usage

Step 5: Multi-core Matrix Multiplication

Build on the single-core matrix multiplication example to implement a Multi-core Matrix Multiplication Kernel that distributes the workload across multiple Tensix cores. This will introduce you to parallel processing and how to optimize performance by leveraging multiple cores.

What you’ll learn: Parallel processing and splitting workloads across multiple cores.

Step 6: Optimized Multi-core Matrix Multiplication

Optimize the multi-core matrix multiplication kernel by implementing exploting the grid structure of the processor. Avoid redundant reads from DRAM and avoid NoC congestion by reading with one core and broadcasting the data to other cores. This will help you understand how to optimize performance by minimizing unneeded data movement and maximizing data reuse.

What you’ll learn: Performance optimization techniques for high-performance kernels.

Resources and Debugging

Documentation

Debugging Tools

For information on debugging and profiling tools (DPRINT, Tracy Profiler, Device Profiler, Inspector, Watcher, etc.), see the Tools documentation.

Community Support

Next Steps

For ML Developers

Use the higher-level TT-NN API for model development and deployment. TT-NN builds on TT-Metalium but provides PyTorch-like operations and automatic kernel selection.

For Architecture Deep Dives

Read the comprehensive Metalium Architecture Guide for deep dives into hardware architecture, NoC topology, and advanced programming patterns.

For Contributors

Review the contribution guidelines before submitting changes. We are always happy to assist in merging your contributions.

For Custom Kernels

Study the provided examples and adapt them to your specific use case. Start simple and optimize iteratively.