The TT-Metal Architecture

Before you write a single line of kernel code, you should understand what you’re writing it for. The Blackhole chip is not a GPU wearing a different nametag. The memory model is different. The execution model is different. The abstraction layers are deliberately transparent. Once you see the architecture clearly, the API choices stop being arbitrary and start being obvious.

The Stack From Top to Bottom

Four layers sit between your Python and the chip. Each layer is real and each layer compiles:

TT-Lang        →  Python DSL, looks like Python, compiles to assembly
TTNN           →  Python ops, tensor API, calls into Metalium
TT-Metalium    →  C++ kernel API, explicit data movement, JIT compile
Kernel Driver  →  firmware, PCIe dispatch, ring buffers

You can enter this stack at any level. TTNN is the right entry point for standard ops. TT-Lang is the right entry point when you need a custom pattern and want AI-assisted development. Metalium is where you go when the abstraction has to disappear.

Blackhole Grid Anatomy

The Blackhole chip is a 17-column by 12-row network-on-chip (NoC) grid. Every cell in that grid is a node. Not every node is a compute core. The grid has four distinct zones:

Tensix cores — columns 1-7 and 9-15, rows 1-10. One hundred and forty physical tiles, of which 120 are enabled on QB2’s chips (two columns are harvested). These are the compute nodes. Each Tensix core is itself a small computer.

DRAM controllers — rows 0 and 11, running the full width of the chip. 32 GB of GDDR6 per chip (64 GB per p300c card). The chip’s main memory lives here, physically along the chip edges, close to the NoC’s routing paths.

ETH ports — column 0 and column 16. These connect chips together. On a QB2’s four Blackhole chips, the ETH ports form the chip-to-chip fabric used by CreateDevices when you open a multi-chip mesh.

PCIe interface — column 8, the center column. Every command from your Python application crosses here. ttnn.open_device(0) sends a dispatch message through this column.

⬡ Tensix Grid — Blackhole (P100/P150/P300c / QB2)

One Blackhole chip. Four of these — on two p300c cards — live in your QB2.

Inside a Tensix Core

Zoom in on any one of those Tensix nodes. Each Tensix core contains:

RISC-V control processor — a small general-purpose CPU that executes your kernel logic
Matrix engine (FPU) — hardware-accelerated matrix multiply and elementwise ops; this is what makes it fast
Register tile files — SRCA, SRCB, and DST registers that hold 32×32 element tiles during computation
L1 SRAM — fast on-core scratchpad memory; your kernel reads data here before the FPU touches it
Two NoC endpoints — one for reads (inbound), one for writes (outbound); both can operate independently and concurrently

The L1 SRAM is crucial. Moving data from DRAM to a Tensix core’s L1 is an explicit operation you control. Nothing is cached automatically. This sounds like a burden and becomes a superpower: you know exactly where every byte is.

The Three-Kernel Model

Every Metalium operation on a Tensix core involves three co-running kernels. All three run on the same core, concurrently:

Data-movement-reader (BRISC) — reads tiles from DRAM or another core’s L1 into this core’s L1 via the read NoC endpoint
Compute — pops tiles from L1 into the SRCA/SRCB registers, runs the matrix engine, writes results to DST, pushes results back to L1
Data-movement-writer (NCRISC) — takes finished tiles from L1 and sends them to DRAM or another core’s L1 via the write NoC endpoint

🔬 Why three kernels? The answer is overlap. On a conventional GPU, compute waits for data to arrive, then data waits for compute to finish. On a Tensix core, the reader can be pulling the next tile from DRAM while the FPU is processing the current tile, while the writer is sending the previous tile downstream. Three pipelines, one core, no idle cycles in the steady state. This is what makes utilization numbers look so different from GPU profiles.

Tiles: The Native Unit

TTNN doesn’t think in terms of individual floats or rows. It thinks in 32×32 tiles. A tensor of shape (64, 64) becomes 4 tiles of shape (32, 32). The tile format — BFP8, BFP16, or FP32 — is set when you create a tensor:

import ttnn, torch

device = ttnn.open_device(device_id=0)

# Create a tensor — TTNN tiles it automatically on device transfer
t = torch.randn(64, 64)
t_tt = ttnn.from_torch(t, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)

# t_tt is now four 32x32 BF16 tiles distributed in the chip's DRAM
print(t_tt.shape)   # torch.Size([64, 64])
print(t_tt.dtype)   # bfloat16

ttnn.close_device(device)

The 32×32 tile size is not adjustable — it is the hardware’s register file size. Every operation on the matrix engine processes one tile at a time. Kernels are written to process tiles, readers fetch tiles, writers send tiles.

The NoC Fabric

The two-dimensional mesh NoC lets any core read from or write to any other core’s L1, or any DRAM bank, by address. There is no coherence protocol, no cache hierarchy. You own the data movement. The routing is deterministic and the bandwidth is high — but contention is possible, which is why the profiler shows per-link NoC traffic.

For a single-chip operation, you’re moving tiles from DRAM row-0 or row-11 nodes, across the mesh, to your compute cores’ L1. For a multi-chip operation via CreateDevices, tiles cross the ETH columns at the chip edges and appear at another chip’s ETH columns before continuing across that chip’s mesh.

A Minimal TTNN Example

This is the entire open-device-matmul-close pattern, which you’ll recognize from every tutorial:

import ttnn, torch

# Open chip 0
device = ttnn.open_device(device_id=0)

# Move data onto the chip
a = ttnn.from_torch(torch.randn(64, 64), dtype=ttnn.bfloat16,
                    layout=ttnn.TILE_LAYOUT, device=device)
b = ttnn.from_torch(torch.randn(64, 64), dtype=ttnn.bfloat16,
                    layout=ttnn.TILE_LAYOUT, device=device)

# Dispatch the matmul kernel — compiles JIT on first run
c = ttnn.matmul(a, b)

# Pull result back to CPU
result = ttnn.to_torch(c)
print(result.shape)

ttnn.close_device(device)

Nothing in this example is magic. Each step maps to a real chip operation: the from_torch calls dispatch DMA transfers through the PCIe column to DRAM; matmul dispatches reader/compute/writer kernels to a set of Tensix cores; to_torch moves the result tiles back through PCIe to host RAM.

Next: Your First Kernel →