Exploring the TT-Metalium Playground
Welcome to the heart of Tenstorrent development! In this lesson you'll discover what's possible with tt-metal/latest/tt-metalium/index.html" target="_blank" rel="noreferrer">TT-Metalium and TTNN, run real hardware code in minutes, and understand the architecture that makes it all tick.
What You'll Do
- β‘ Run your first TTNN operation on TT hardware in five lines of code
- π§ Understand tile-based computing and the Tensix core
- ποΈ Explore the three-kernel programming model
- π Browse the model zoo and Jupyter tutorials
- π§ See the path from TTNN (high-level Python) to TT-Metalium (custom C++ kernels)
Before You Start: Run This Right Now
If you have tt-metal built and your venv activated, you can be running real TTNN code in 60 seconds. No Jupyter, no setup β just Python:
# Activate the tt-metal Python environment
source ~/tt-metal/python_env/bin/activate
export TT_METAL_HOME=~/tt-metal
export PYTHONPATH=$TT_METAL_HOME:$PYTHONPATH
# Run the first tutorial β adds two tensors on TT hardware
python3 ~/tt-metal/ttnn/tutorials/basic_python/ttnn_add_tensors.py
You'll see the device open, the computation run, and the device close. That's real silicon doing real work. The full tutorial collection lives at:
~/tt-metal/ttnn/tutorials/basic_python/
No Jupyter required β every notebook also has a .py companion you can run
directly.
Don't have
~/tt-metalbuilt yet? Start with Build tt-metal from Source first, then return here.
Why This Hardware is Different
Before diving in, here's what makes Tenstorrent hardware worth exploring:
Wormhole N150 (single chip, 8 TOPS):
- Runs Llama 3.1 8B at ~20 tok/s
- Generates 512Γ512 images in ~30s with Stable Diffusion
- Runs BERT-Large inference at ~400 sentences/sec
Tenstorrent Galaxy (32 Wormhole chips, 256 TOPS):
- Runs DeepSeek-V3 (685B parameters) in production
- Stable Diffusion 3.5 Large in 5.6 seconds per image
- Llama 3 70B at hundreds of tok/s
The same TTNN Python code runs on all of these. You write for N150, scale to Galaxy by changing a device count. That's the architecture advantage this lesson explores.
Part 1: Run the Tutorial Scripts
The Quickest Path: basic_python Scripts
Every TTNN concept has a runnable Python script. These are the best starting point because they don't require Jupyter and have clear, commented code:
cd ~/tt-metal
source python_env/bin/activate
# Tensor basics: create, fill, add on device
python3 ttnn/tutorials/basic_python/ttnn_add_tensors.py
# Core operations: element-wise, reductions, broadcasting
python3 ttnn/tutorials/basic_python/ttnn_basic_operations.py
# Matrix multiplication: the workhorse of neural nets
python3 ttnn/tutorials/basic_python/ttnn_basic_matrix_multiplication.py
# 2D convolution on TT hardware
python3 ttnn/tutorials/basic_python/ttnn_basic_conv.py
# Full inference pipeline: MLP on MNIST
# β οΈ Train weights first (CPU-only, ~1 min): saves mlp_mnist_weights.pt
python3 ttnn/tutorials/basic_python/train_and_export_mlp.py
python3 ttnn/tutorials/basic_python/ttnn_mlp_inference_mnist.py
# Transformer building block: multi-head attention
python3 ttnn/tutorials/basic_python/ttnn_multihead_attention.py
# CNN inference end-to-end
# β οΈ Train weights first: saves simplecnn_mnist_weights.pt
python3 ttnn/tutorials/basic_python/train_and_export_cnn.py
python3 ttnn/tutorials/basic_python/ttnn_simplecnn_inference.py
Training step required:
ttnn_mlp_inference_mnist.pyandttnn_simplecnn_inference.pyload weights from.ptfiles. Without them the scripts use random weights and report ~20% accuracy. Run the correspondingtrain_and_export_*.pyfirst β CPU-only, ~1 minute each.
Recommended order: ttnn_add_tensors β ttnn_basic_operations β
ttnn_basic_matrix_multiplication β train_and_export_mlp β ttnn_mlp_inference_mnist.
Jupyter Notebooks
If you prefer interactive Jupyter notebooks, the same content is available as .ipynb
files in the same directory:
~/tt-metal/ttnn/tutorials/
π Open TTNN Tutorials
Available notebooks:
ttnn_intro.ipynbβ Introduction to TTNN conceptsttnn_add_tensors.ipynbβ Tensor creation and additionttnn_basic_operations.ipynbβ Element-wise ops, reductionsttnn_basic_matrix_multiplication.ipynbβ matmul deep divettnn_basic_conv.ipynbβ 2D convolution fundamentalsttnn_mlp_inference_mnist.ipynbβ Complete inference pipelinettnn_multihead_attention.ipynbβ Transformer building blocksttnn_simplecnn_inference.ipynbβ End-to-end CNN examplettnn_clip_zero_shot_classification.ipynbβ CLIP model inference
Part 2: The Model Zoo β What Runs Today
Tenstorrent's model repository is one of the most extensive collections of hardware-optimized AI models available. Here's what you can run right now:
π Browse Model Zoo
Production-Ready (models/demos/)
Language Models:
- Llama 3.1 8B β Chat, code, reasoning (N150/N300)
- Llama 3 70B β Large-scale inference (Galaxy, 32 chips)
- DeepSeek-V3 β State-of-the-art reasoning (Galaxy)
- Gemma 3 27B β Multimodal text+image, 128K context (N300/T3K)
- Qwen 2.5 VL β Vision-language understanding
Vision Models:
- Stable Diffusion 1.4 β Text-to-image (N150/N300/P100)
- YOLO v10/v11/v12 β Real-time object detection
- SegFormer β Semantic segmentation
- SigLIP β Image-text matching
- ResNet50, MobileNetV2 β Image classification at speed
- BERT, DistilBERT β NLP understanding
Audio:
- Whisper β Speech-to-text transcription
Experimental (models/experimental/)
- Stable Diffusion 3.5 Large β via tt-dit (Galaxy/QuietBox 8+ chips)
- Flux 1 β Text-to-image generation
- Mochi-1 β Native video generation
- Wan 2.2 β Text-to-video model
- nanoGPT β Train a GPT from scratch on device
- Grok β xAI reasoning model port
Hardware-Organized Demos
Models are organized by target hardware for easy discovery:
models/demos/wormhole/ β N150/N300 optimized
models/demos/t3000/ β T3K (8-chip) configurations
models/demos/blackhole/ β P100/P300c (Blackhole)
models/demos/tg/ β Galaxy (32-chip)
π― What's possible:
- Run a 685B parameter model β DeepSeek-V3 on Galaxy
- 128K context windows β Read entire books as context
- Real-time object detection β YOLO v12 on N150
- Train models on device β nanoGPT is buildable from scratch
- Native video generation β Mochi and Wan 2.2 (experimental)
Part 3: Understanding the Architecture
The Tensix Core
Each Tenstorrent chip contains a grid of Tensix cores. Understanding their architecture helps you write efficient code.
Inside a Tensix Core:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Tensix Core β
βββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββββββββ β
β β 5 RISC-VβββββΆβ 1.5 MB SRAM β β
β β "Baby" β β (L1 Memory) β β
β β CPUs β ββββββββββββββββββ β
β ββββββββββββ β β
β β β
β βββββββββββββββββββββ΄βββββββββββ β
β β β β
β ββββββΌββββββ βββββββΌβββββ β
β β Matrix β β Vector β β
β β Engine β β Unit β β
β β (FPU) β β (SFPU) β β
β β β β β β
β β 32Γ32 β β Element- β β
β β Tiles β β wise β β
β ββββββββββββ ββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββ β
β β Network-on-Chip (NoC) - 2 Paths β β
β β NoC 0: Reads NoC 1: Writes β β
β ββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
βΌ βΌ
DRAM Banks Other Tensix Cores
Key components:
- 5 RISC-V "Baby" CPUs β Control and orchestration; run your kernel code
- 1.5 MB L1 SRAM β Fast local memory, explicitly managed (no cache)
- Matrix Engine (FPU) β Hardware accelerator for 32Γ32 tile matmul
- Vector Unit (SFPU) β Element-wise ops: ReLU, GELU, Softmax, custom math
- Network-on-Chip (NoC) β Two independent paths; connects DRAM and cores
Tile-Based Computing
Why 32Γ32 tiles?
Traditional GPUs process data in linear layouts. Tenstorrent uses 32Γ32 tiles as the native format because it matches the Matrix Engine hardware perfectly:
import ttnn
import torch
device = ttnn.open_device(device_id=0)
# ROW_MAJOR layout (like NumPy/PyTorch)
row_major = ttnn.from_torch(
torch.rand((3, 4)),
layout=ttnn.ROW_MAJOR_LAYOUT,
device=device
)
print(f"Shape: {row_major.shape}, Padded: {row_major.padded_shape}")
# Output: Shape([3, 4]), Padded: Shape([3, 4])
# TILE_LAYOUT β native format, padded to 32Γ32 minimum
tile = ttnn.to_layout(row_major, ttnn.TILE_LAYOUT)
print(f"Shape: {tile.shape}, Padded: {tile.padded_shape}")
# Output: Shape([3, 4]), Padded: Shape([32, 32])
# Padding added automatically to fill 32Γ32 tile!
ttnn.close_device(device)
Performance tip: Operations on tile-aligned shapes (multiples of 32) are fastest! Non-aligned shapes work but waste some compute on the padding.
The Three-Kernel Programming Model
Most operations use three kernels working together in a pipeline:
Reader Kernel Compute Kernel Writer Kernel
(Data Movement) (Math Operations) (Data Movement)
β β β
βββββββββββββΌβββββββββββ ββββββββββββΌββββββββββ ββββββββββββΌββββββββββ
β Fetch from DRAM ββββΆβ Process in SRAM ββββΆβ Store to DRAM β
β via NoC 0 β β (Matrix/Vector) β β via NoC 1 β
ββββββββββββββββββββββββ ββββββββββββββββββββββ ββββββββββββββββββββββ
Circular Buffers in L1 SRAM enable pipelining:
- Reader fills buffer while Compute processes previous batch
- Compute fills output buffer while Writer stores previous batch
This architecture means there is no hidden cache thrashing β every data movement is explicit. That's why profiling Metalium programs is precise: you know exactly what's moving where.
Two Levels of Abstraction
TTNN (Python) β High Level:
import ttnn
device = ttnn.open_device(device_id=0)
a = ttnn.rand((32, 32), device=device, layout=ttnn.TILE_LAYOUT)
b = ttnn.rand((32, 32), device=device, layout=ttnn.TILE_LAYOUT)
c = ttnn.matmul(a, b) # Matrix multiply
d = ttnn.add(c, 1.0) # Add scalar
e = ttnn.gelu(d) # Activation
result = ttnn.to_torch(e)
ttnn.close_device(device)
Use TTNN for: rapid prototyping, standard model inference, Python-first development.
TT-Metalium (C++) β Low Level:
#include "tt_metal/host_api.hpp"
using namespace tt::tt_metal;
int main() {
Device* device = CreateDevice(0);
Program program = CreateProgram();
// Define reader, compute, and writer kernels
auto reader = CreateKernel(program, "kernels/reader.cpp", core,
DataMovementConfig{...});
auto compute = CreateKernel(program, "kernels/compute.cpp", core,
ComputeConfig{...});
EnqueueProgram(command_queue, program, false);
Finish(command_queue);
CloseDevice(device);
}
Use TT-Metalium for: maximum performance, custom operations, novel algorithms, research.
Part 4: Programming Examples
Build and Run Examples
The programming examples demonstrate Metalium kernels from hello world through multi-core matrix multiply. Build them with:
cd ~/tt-metal
./build_metal.sh --build-programming-examples
This takes an additional 5β10 minutes but gives you standalone executables.
Beginner:
| Example | What It Teaches |
|---|---|
| Hello World Compute | Your first compute kernel |
| Hello World Data Movement | Your first reader/writer kernel |
| Add 2 Integers | Basic arithmetic on device |
| DRAM Loopback | Buffer creation, data movement |
# Run after building with --build-programming-examples
./build/programming_examples/hello_world_compute_kernel
./build/programming_examples/hello_world_datamovement_kernel
./build/programming_examples/add_2_integers_in_compute
Intermediate:
| Example | What It Teaches |
|---|---|
| Eltwise Binary | Element-wise ops with circular buffers |
| Eltwise SFPU | Vector operations (SFPU math) |
| Matmul Single Core | Using the matrix engine |
| Matmul Multi Core | Parallel execution across cores |
Hands-On: Tile Padding Experiment
Run this short script to see how TTNN handles the 32Γ32 tile requirement:
cat > /tmp/tile_experiment.py << 'EOF'
import ttnn
import torch
device = ttnn.open_device(device_id=0)
cases = [(5, 5), (100, 50), (128, 128), (1024, 1024)]
for shape in cases:
t = ttnn.from_torch(
torch.rand(shape),
layout=ttnn.TILE_LAYOUT,
device=device
)
pad_r = t.padded_shape[-2] - shape[0]
pad_c = t.padded_shape[-1] - shape[1]
print(f"{shape[0]:5}Γ{shape[1]:<5} β padded {t.padded_shape[-2]}Γ{t.padded_shape[-1]} "
f"(wasted: {pad_r * t.padded_shape[-1] + pad_c * shape[0]} elements)")
ttnn.close_device(device)
print("\nRule: dimensions always pad to next multiple of 32.")
print("For best performance, design your model shapes to be multiples of 32.")
EOF
cd ~/tt-metal && python3 /tmp/tile_experiment.py
Observe:
- How much padding each shape requires
- Why 128Γ128 and 1024Γ1024 are "free" (already tile-aligned)
- What the padding cost is for 5Γ5 (nearly 4Γ the data!)
Key Takeaways
- β TTNN runs on every Tenstorrent chip β write once, scale from N150 to Galaxy
- β Tile-based computing (32Γ32) is the native format β align your shapes!
- β Three-kernel model (ReaderβComputeβWriter) enables pipelined execution
- β Explicit memory (L1 SRAM) instead of caches β predictable performance
- β Production models exist for LLMs, vision, audio, video, and more
- β Both levels matter: TTNN for productivity, Metalium for maximum performance
What's Next?
In the Metalium Cookbook, you'll apply these concepts building four creative projects:
- Conway's Game of Life β Cellular automata with parallel tile computing
- Audio Processor β Real-time mel-spectrogram and effects
- Mandelbrot Explorer β GPU-style fractal rendering
- Custom Image Filters β Creative visual effects
π Continue to JAX Inference with TT-XLA
Resources
- METALIUM_GUIDE.md:
~/tt-metal/METALIUM_GUIDE.mdβ β Architecture deep-dive - Tutorial scripts:
~/tt-metal/ttnn/tutorials/basic_python/β Runnable Python files - Jupyter notebooks:
~/tt-metal/ttnn/tutorials/β Interactive notebooks - Programming examples:
~/tt-metal/tt_metal/programming_examples/ - Tech reports:
~/tt-metal/tech_reports/β Flash Attention, architecture papers - Official docs: docs.tenstorrent.com
- Discord: discord.gg/tvhGzHQwaj
Troubleshooting
ttnn.open_device() fails:
tt-smi # Check device status
tt-smi -r # Reset if showing errors
Jupyter notebooks won't open:
code --install-extension ms-toolsai.jupyter
Out of memory:
- Reduce batch sizes
- Use tile-aligned dimensions (multiples of 32)
- Release tensors:
ttnn.deallocate(tensor)
Slow performance:
- Non-tile-aligned shapes add padding overhead β use multiples of 32
- Minimize
to_torch()/from_torch()round-trips - Always set
layout=ttnn.TILE_LAYOUTfor compute-intensive ops