N150 N300 T3K P100 P150 P300C Galaxy 30 min Validated

Exploring the TT-Metalium Playground

Welcome to the heart of Tenstorrent development! In this lesson you'll discover what's possible with tt-metal/latest/tt-metalium/index.html" target="_blank" rel="noreferrer">TT-Metalium and TTNN, run real hardware code in minutes, and understand the architecture that makes it all tick.

What You'll Do


Before You Start: Run This Right Now

If you have tt-metal built and your venv activated, you can be running real TTNN code in 60 seconds. No Jupyter, no setup β€” just Python:

# Activate the tt-metal Python environment
source ~/tt-metal/python_env/bin/activate
export TT_METAL_HOME=~/tt-metal
export PYTHONPATH=$TT_METAL_HOME:$PYTHONPATH

# Run the first tutorial β€” adds two tensors on TT hardware
python3 ~/tt-metal/ttnn/tutorials/basic_python/ttnn_add_tensors.py

You'll see the device open, the computation run, and the device close. That's real silicon doing real work. The full tutorial collection lives at:

~/tt-metal/ttnn/tutorials/basic_python/

No Jupyter required β€” every notebook also has a .py companion you can run directly.

Don't have ~/tt-metal built yet? Start with Build tt-metal from Source first, then return here.


Why This Hardware is Different

Before diving in, here's what makes Tenstorrent hardware worth exploring:

Wormhole N150 (single chip, 8 TOPS):

Tenstorrent Galaxy (32 Wormhole chips, 256 TOPS):

The same TTNN Python code runs on all of these. You write for N150, scale to Galaxy by changing a device count. That's the architecture advantage this lesson explores.


Part 1: Run the Tutorial Scripts

The Quickest Path: basic_python Scripts

Every TTNN concept has a runnable Python script. These are the best starting point because they don't require Jupyter and have clear, commented code:

cd ~/tt-metal
source python_env/bin/activate

# Tensor basics: create, fill, add on device
python3 ttnn/tutorials/basic_python/ttnn_add_tensors.py

# Core operations: element-wise, reductions, broadcasting
python3 ttnn/tutorials/basic_python/ttnn_basic_operations.py

# Matrix multiplication: the workhorse of neural nets
python3 ttnn/tutorials/basic_python/ttnn_basic_matrix_multiplication.py

# 2D convolution on TT hardware
python3 ttnn/tutorials/basic_python/ttnn_basic_conv.py

# Full inference pipeline: MLP on MNIST
# ⚠️  Train weights first (CPU-only, ~1 min): saves mlp_mnist_weights.pt
python3 ttnn/tutorials/basic_python/train_and_export_mlp.py
python3 ttnn/tutorials/basic_python/ttnn_mlp_inference_mnist.py

# Transformer building block: multi-head attention
python3 ttnn/tutorials/basic_python/ttnn_multihead_attention.py

# CNN inference end-to-end
# ⚠️  Train weights first: saves simplecnn_mnist_weights.pt
python3 ttnn/tutorials/basic_python/train_and_export_cnn.py
python3 ttnn/tutorials/basic_python/ttnn_simplecnn_inference.py

Training step required: ttnn_mlp_inference_mnist.py and ttnn_simplecnn_inference.py load weights from .pt files. Without them the scripts use random weights and report ~20% accuracy. Run the corresponding train_and_export_*.py first β€” CPU-only, ~1 minute each.

Recommended order: ttnn_add_tensors β†’ ttnn_basic_operations β†’ ttnn_basic_matrix_multiplication β†’ train_and_export_mlp β†’ ttnn_mlp_inference_mnist.

Jupyter Notebooks

If you prefer interactive Jupyter notebooks, the same content is available as .ipynb files in the same directory:

~/tt-metal/ttnn/tutorials/

πŸ““ Open TTNN Tutorials

Available notebooks:


Part 2: The Model Zoo β€” What Runs Today

Tenstorrent's model repository is one of the most extensive collections of hardware-optimized AI models available. Here's what you can run right now:

πŸ” Browse Model Zoo

Production-Ready (models/demos/)

Language Models:

Vision Models:

Audio:

Experimental (models/experimental/)

Hardware-Organized Demos

Models are organized by target hardware for easy discovery:

models/demos/wormhole/   β€” N150/N300 optimized
models/demos/t3000/      β€” T3K (8-chip) configurations
models/demos/blackhole/  β€” P100/P300c (Blackhole)
models/demos/tg/         β€” Galaxy (32-chip)

🎯 What's possible:

  1. Run a 685B parameter model β€” DeepSeek-V3 on Galaxy
  2. 128K context windows β€” Read entire books as context
  3. Real-time object detection β€” YOLO v12 on N150
  4. Train models on device β€” nanoGPT is buildable from scratch
  5. Native video generation β€” Mochi and Wan 2.2 (experimental)

Part 3: Understanding the Architecture

The Tensix Core

Each Tenstorrent chip contains a grid of Tensix cores. Understanding their architecture helps you write efficient code.

Inside a Tensix Core:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Tensix Core                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚  5 RISC-V│───▢│  1.5 MB SRAM   β”‚           β”‚
β”‚  β”‚  "Baby"  β”‚    β”‚   (L1 Memory)  β”‚           β”‚
β”‚  β”‚  CPUs    β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚                     β”‚
β”‚                           β”‚                     β”‚
β”‚       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚       β”‚                               β”‚         β”‚
β”‚  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”                  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”   β”‚
β”‚  β”‚  Matrix  β”‚                  β”‚  Vector  β”‚   β”‚
β”‚  β”‚  Engine  β”‚                  β”‚  Unit    β”‚   β”‚
β”‚  β”‚  (FPU)   β”‚                  β”‚  (SFPU)  β”‚   β”‚
β”‚  β”‚          β”‚                  β”‚          β”‚   β”‚
β”‚  β”‚ 32Γ—32    β”‚                  β”‚ Element- β”‚   β”‚
β”‚  β”‚ Tiles    β”‚                  β”‚  wise    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Network-on-Chip (NoC) - 2 Paths        β”‚ β”‚
β”‚  β”‚  NoC 0: Reads    NoC 1: Writes          β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                           β”‚
        β–Ό                           β–Ό
    DRAM Banks              Other Tensix Cores

Key components:

  1. 5 RISC-V "Baby" CPUs β€” Control and orchestration; run your kernel code
  2. 1.5 MB L1 SRAM β€” Fast local memory, explicitly managed (no cache)
  3. Matrix Engine (FPU) β€” Hardware accelerator for 32Γ—32 tile matmul
  4. Vector Unit (SFPU) β€” Element-wise ops: ReLU, GELU, Softmax, custom math
  5. Network-on-Chip (NoC) β€” Two independent paths; connects DRAM and cores

Tile-Based Computing

Why 32Γ—32 tiles?

Traditional GPUs process data in linear layouts. Tenstorrent uses 32Γ—32 tiles as the native format because it matches the Matrix Engine hardware perfectly:

import ttnn
import torch

device = ttnn.open_device(device_id=0)

# ROW_MAJOR layout (like NumPy/PyTorch)
row_major = ttnn.from_torch(
    torch.rand((3, 4)),
    layout=ttnn.ROW_MAJOR_LAYOUT,
    device=device
)
print(f"Shape: {row_major.shape}, Padded: {row_major.padded_shape}")
# Output: Shape([3, 4]), Padded: Shape([3, 4])

# TILE_LAYOUT β€” native format, padded to 32Γ—32 minimum
tile = ttnn.to_layout(row_major, ttnn.TILE_LAYOUT)
print(f"Shape: {tile.shape}, Padded: {tile.padded_shape}")
# Output: Shape([3, 4]), Padded: Shape([32, 32])
# Padding added automatically to fill 32Γ—32 tile!

ttnn.close_device(device)

Performance tip: Operations on tile-aligned shapes (multiples of 32) are fastest! Non-aligned shapes work but waste some compute on the padding.


The Three-Kernel Programming Model

Most operations use three kernels working together in a pipeline:

     Reader Kernel              Compute Kernel             Writer Kernel
     (Data Movement)            (Math Operations)         (Data Movement)
            β”‚                          β”‚                         β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Fetch from DRAM     │──▢│  Process in SRAM   │──▢│  Store to DRAM     β”‚
β”‚  via NoC 0           β”‚   β”‚  (Matrix/Vector)   β”‚   β”‚  via NoC 1         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Circular Buffers in L1 SRAM enable pipelining:
- Reader fills buffer while Compute processes previous batch
- Compute fills output buffer while Writer stores previous batch

This architecture means there is no hidden cache thrashing β€” every data movement is explicit. That's why profiling Metalium programs is precise: you know exactly what's moving where.


Two Levels of Abstraction

TTNN (Python) β€” High Level:

import ttnn

device = ttnn.open_device(device_id=0)

a = ttnn.rand((32, 32), device=device, layout=ttnn.TILE_LAYOUT)
b = ttnn.rand((32, 32), device=device, layout=ttnn.TILE_LAYOUT)

c = ttnn.matmul(a, b)   # Matrix multiply
d = ttnn.add(c, 1.0)    # Add scalar
e = ttnn.gelu(d)        # Activation

result = ttnn.to_torch(e)
ttnn.close_device(device)

Use TTNN for: rapid prototyping, standard model inference, Python-first development.


TT-Metalium (C++) β€” Low Level:

#include "tt_metal/host_api.hpp"

using namespace tt::tt_metal;

int main() {
    Device* device = CreateDevice(0);
    Program program = CreateProgram();

    // Define reader, compute, and writer kernels
    auto reader = CreateKernel(program, "kernels/reader.cpp", core,
                               DataMovementConfig{...});
    auto compute = CreateKernel(program, "kernels/compute.cpp", core,
                                ComputeConfig{...});

    EnqueueProgram(command_queue, program, false);
    Finish(command_queue);
    CloseDevice(device);
}

Use TT-Metalium for: maximum performance, custom operations, novel algorithms, research.


Part 4: Programming Examples

Build and Run Examples

The programming examples demonstrate Metalium kernels from hello world through multi-core matrix multiply. Build them with:

cd ~/tt-metal
./build_metal.sh --build-programming-examples

This takes an additional 5–10 minutes but gives you standalone executables.

Beginner:

Example What It Teaches
Hello World Compute Your first compute kernel
Hello World Data Movement Your first reader/writer kernel
Add 2 Integers Basic arithmetic on device
DRAM Loopback Buffer creation, data movement
# Run after building with --build-programming-examples
./build/programming_examples/hello_world_compute_kernel
./build/programming_examples/hello_world_datamovement_kernel
./build/programming_examples/add_2_integers_in_compute

Intermediate:

Example What It Teaches
Eltwise Binary Element-wise ops with circular buffers
Eltwise SFPU Vector operations (SFPU math)
Matmul Single Core Using the matrix engine
Matmul Multi Core Parallel execution across cores

Hands-On: Tile Padding Experiment

Run this short script to see how TTNN handles the 32Γ—32 tile requirement:

cat > /tmp/tile_experiment.py << 'EOF'
import ttnn
import torch

device = ttnn.open_device(device_id=0)

cases = [(5, 5), (100, 50), (128, 128), (1024, 1024)]

for shape in cases:
    t = ttnn.from_torch(
        torch.rand(shape),
        layout=ttnn.TILE_LAYOUT,
        device=device
    )
    pad_r = t.padded_shape[-2] - shape[0]
    pad_c = t.padded_shape[-1] - shape[1]
    print(f"{shape[0]:5}Γ—{shape[1]:<5}  β†’  padded {t.padded_shape[-2]}Γ—{t.padded_shape[-1]}  "
          f"(wasted: {pad_r * t.padded_shape[-1] + pad_c * shape[0]} elements)")

ttnn.close_device(device)
print("\nRule: dimensions always pad to next multiple of 32.")
print("For best performance, design your model shapes to be multiples of 32.")
EOF
cd ~/tt-metal && python3 /tmp/tile_experiment.py

Observe:


Key Takeaways


What's Next?

In the Metalium Cookbook, you'll apply these concepts building four creative projects:

  1. Conway's Game of Life β€” Cellular automata with parallel tile computing
  2. Audio Processor β€” Real-time mel-spectrogram and effects
  3. Mandelbrot Explorer β€” GPU-style fractal rendering
  4. Custom Image Filters β€” Creative visual effects

πŸš€ Continue to JAX Inference with TT-XLA


Resources


Troubleshooting

ttnn.open_device() fails:

tt-smi    # Check device status
tt-smi -r # Reset if showing errors

Jupyter notebooks won't open:

code --install-extension ms-toolsai.jupyter

Out of memory:

Slow performance: