Your First Model

Everything up to now was preparation. This is the part where the machine does something interesting. Four chips, waiting. One small model, about to arrive.

Running Your First Model

⚡ Already loaded: your QB2 ships with Qwen3-32B pre-cached on disk. The no-download path to your first token is tt-studio — run tt-studio, pick Qwen3-32B from the Deploy Model dropdown, click Run. The first deploy takes a few minutes (no multi-GB download — the weights are already there). You enter a Hugging Face token once; the model is gated even though the weights are local.

This chapter takes the other path — the hands-on one, where you talk to a chip directly in Python and pull a tiny model down yourself. The starter is Qwen/Qwen3-0.6B — no license gate, 1.5 GB, runs on any Tenstorrent hardware.

First, activate the TTNN environment and verify the hardware is accessible:

source ~/tt-metal/python_env/bin/activate

Your prompt will change to show (python_env). That which python3 will now point into the venv, not /usr/bin/python3. Check it:

which python3
# → /home/yourname/tt-metal/python_env/bin/python3

Now do the handshake — open a device, confirm it responds, close it:

python3 -c "
import ttnn
device = ttnn.open_device(device_id=0)
print('Device open:', device)
ttnn.close_device(device)
print('Done.')
"

If you see Device open: without errors, chip 0 is alive and responding. Repeat with device_id=1, 2, 3 to verify all four.

⚠️ QB2 note: To work with all four chips together, use ttnn.CreateDevices({0, 1, 2, 3}) — not four separate open_device() calls. Opening and closing devices individually can cause dispatch core errors on multi-chip configs.

Download a model

Use the hf CLI (part of the huggingface_hub package already installed in the venv):

# hf — not huggingface-cli. The command is hf.
hf download Qwen/Qwen3-0.6B --local-dir ~/models/Qwen3-0.6B

This creates ~/models/Qwen3-0.6B/ with the HuggingFace-format weights (~1.5 GB). Check your disk first:

df -h ~

You need at least 3 GB free for this model alone. Larger models (Llama-3.1-8B) need 16+ GB.

TTNN device open handshake and model files check — TTNN device open handshake on chip 0 — then Qwen3-0.6B files on disk

What Just Happened

When that Python snippet ran without errors, the Blackhole chip opened a dispatch channel through the PCIe link, initialized its RISC-V cores, and confirmed it can receive work. Nothing computed yet. But the handshake — software to silicon — is the prerequisite for everything else.

⬡ Tensix Grid — Blackhole (P100/P150/P300c / QB2)

ttnn.open_device(0) — what happens inside the chip.

Serving a Model with vLLM

The fastest path to actually generating text is vLLM. It handles model loading, tokenization, batching, and presents an OpenAI-compatible HTTP API.

source ~/.tenstorrent-venv/bin/activate

# Make sure the model is downloaded first (see above)
# Then start the server:
python3 -m vllm.entrypoints.openai.api_server \
  --model ~/models/Qwen3-0.6B \
  --port 8000

You’ll see initialization messages as the model loads. This takes a minute or two on first run — the model weights are being compiled for the Blackhole architecture. Subsequent runs are faster.

Once you see INFO: Application startup complete, the server is ready. In a new terminal:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3-0.6B",
    "messages": [{"role": "user", "content": "What makes the Tenstorrent Blackhole chip different?"}]
  }' | python3 -m json.tool

The response is JSON. The answer is in choices[0].message.content.

💡 Why Qwen3-0.6B? It's the recommended starter model for all Tenstorrent hardware: small enough to load fast (~1.5 GB), capable enough to give real answers, reasoning-capable with dual thinking modes (add "think": false to the request to skip extended reasoning), and requires no Hugging Face license. Start here before trying larger models.

Using tt-studio (the Web UI)

tt-studio

tt-studio is a web interface for running models on QB2 without writing a line of code. It handles model selection, container lifecycle, and inference end-to-end — open a browser, pick a model, get tokens back. It’s the lowest-effort path to your first token on a QB2.

Start it with the pre-installed wrapper command:

tt-studio

Then open http://localhost:3000 in your browser, pick a model from the Deploy Model dropdown, and click Run. On a QB2, Qwen3-32B is already there with its weights pre-cached — its first deploy skips the multi-GB download and is ready in a few minutes. Other models download on first use; after that, every run loads fast from the on-disk cache. (tt-studio v2.8.0 also fixed the cold first-chat delay after an idle model, so that first token comes back quickly.)

ℹ What the wrapper does: tt-studio is a convenience command the QB2 ships. Under the hood it launches the same stack you'd get by cloning the repo and running python3 run.py — that sets up the submodule and .env, prompts for your Hugging Face token, selects the right Docker overlays for your hardware, and brings up the Django + React app plus the model containers, then serves the UI at localhost:3000. On any other machine, that clone-and-run.py flow is how you'd start it.

What’s happening under the hood: tt-studio is a UI sitting on top of tt-inference-server. When you select a model and click Run, tt-studio spins up a Docker container running the TT fork of vLLM on port 8000. Your browser talks to tt-studio; tt-studio talks to that container. tt-local-generator routes through the same container — both are UIs sitting on top of tt-inference-server, just with different front ends.

To access tt-studio from your laptop while the QB2 is on your network, forward the port over SSH:

ssh -L 3000:localhost:3000 user@qb2-hostname

Then open http://localhost:3000 on your local machine as if you were sitting in front of the QB2.

For a deeper look at how the inference server is wired up, the tt-vscode-toolkit lesson on tt-inference-server walks through the architecture interactively — Docker flags, model download, port mapping, and what logs to watch on first boot.

ℹ Two UIs, one server: tt-studio and tt-local-generator are both front ends for tt-inference-server. You can switch between them freely — they talk to the same running container on port 8000.

🤖 New in v2.8.0 — your QB2 as a coding backend: tt-studio can now serve a deployed model to Claude Code and OpenCode through a built-in gateway, so a coding agent runs against your own chips instead of a cloud API. It also added text-to-video (WAN) and image (Flux) generation. See Serving Models on QB2 for the coding-agent setup.

tt-studio on PATH, startup command, SSH port-forward instructions, --help output — tt-studio is a single command — starts a web UI at localhost:3000, accessible via SSH tunnel from your laptop

Multi-Device: Using All Four Chips

To spread a model across all four Blackhole chips, use CreateDevices instead of open_device:

source ~/tt-metal/python_env/bin/activate

python3 -c "
import ttnn
devices = ttnn.CreateDevices({0, 1, 2, 3})
print('All devices:', devices)
ttnn.CloseDevices(devices)
print('Done.')
"

CreateDevices handles the mesh configuration that lets the chips coordinate. Models loaded this way can distribute layers across chips, increasing the effective memory pool and throughput. Large models (Llama-3.1-70B) require this — they don’t fit on one chip’s memory alone.

⬡ One mesh, four chips — what CreateDevices opens

CreateDevices spans all four chips: a large model's layers spread across them for more memory and throughput. (A small model like Qwen3-0.6B runs happily on one chip.)

TTNN device open and Qwen3-0.6B model files on a live QB2 — Opening TTNN device and browsing model files on a live QB2

Next: What Comes Next →