Cross-track lesson · All paths

Running Llama-3.3-70B on QB2

A full 70-billion parameter model, running locally on the four Blackhole chips in your TT-QuietBox® 2. No cloud. No API key. Just your hardware.

Explore Run & build Tinker Customize

~45 min (mostly waiting on model download) Prerequisites: Docker installed, HuggingFace account

What you’re actually deploying

Llama-3.3-70B-Instruct from Meta. 70 billion parameters — the largest Llama model that fits on a single QB2. This is the model that, two years ago, required a dedicated cloud VM with 8× A100s. Your QB2 has four Blackhole chips (on two p300c cards); together they have enough DRAM bandwidth and capacity to run it.

The same command also runs these weight variants:

Llama-3.1-70B-Instruct — slightly older, same architecture
DeepSeek-R1-Distill-Llama-70B — a reasoning model distilled from DeepSeek-R1 into the Llama-70B architecture. Swap the model name and you get chain-of-thought reasoning output from the same server.

Model

Llama-3.3-70B-Instruct

meta-llama/Llama-3.3-70B-Instruct

Status

🟡 Functional

Tested on QB2 (p300x2)

Max context

131,072 tokens

128K context window

Max batch size

Concurrent requests

tt-inference-server identifies the QB2 as p300x2 — two p300c cards, four Blackhole chips. That’s the --tt-device value to pass for a model that needs the whole box, like this one.

Before you start

Docker must be installed. The tt-inference-server uses Docker containers to manage the environment. If you’ve completed the Explore track, Docker is already present. Verify:

docker --version
# Docker version 24.x or later

HuggingFace token with Llama access. Meta’s Llama models require accepting a license agreement on HuggingFace and using a token. This is a one-time step.

Go to huggingface.co/meta-llama/Llama-3.3-70B-Instruct
Log in and accept the license
Create a read token at huggingface.co/settings/tokens
Set it in your environment:

export HF_TOKEN=hf_your_token_here
# Add to ~/.bashrc to persist across sessions:
echo 'export HF_TOKEN=hf_your_token_here' >> ~/.bashrc

Disk space. The model weights are approximately 140 GB. Docker volumes store them in /var/lib/docker/volumes/. Make sure you have that space available:

df -h /var/lib/docker

Hugepages. The Tenstorrent driver requires 1G hugepages. If you’ve run any model before, these are already configured. To verify:

cat /proc/meminfo | grep HugePages
# HugePages_Total should be > 0

If hugepages are missing, the tt-installer script sets them up. See the install chapter.

Step 1 — Pull tt-inference-server

tt-inference-server is Tenstorrent’s Docker-based deployment tool. It wraps a TT-Metal-optimized fork of vLLM with one-command launch syntax.

git clone https://github.com/tenstorrent/tt-inference-server ~/code/tt-inference-server
cd ~/code/tt-inference-server

If you already have a clone, update it:

cd ~/code/tt-inference-server
git pull

Step 2 — Start the server

The simplest path is the run.py helper from tt-inference-server — one command that pulls the container, downloads and compiles the weights, and maps the port:

cd ~/code/tt-inference-server
python3 run.py --model Llama-3.3-70B-Instruct --tt-device p300x2 --workflow server --docker-server

Under the hood, run.py launches the TT vLLM container. If you’d rather drive Docker yourself — to pin flags, or run without the repo — the equivalent is:

docker run \
  --env "HF_TOKEN=$HF_TOKEN" \
  --ipc host \
  --publish 8000:8000 \
  --device /dev/tenstorrent \
  --mount type=bind,src=/dev/hugepages-1G,dst=/dev/hugepages-1G \
  --volume volume_id_Llama-3.3-70B-Instruct:/home/container_app_user/cache_root \
  ghcr.io/tenstorrent/tt-inference-server/vllm-tt-metal-src-release-ubuntu-22.04-amd64:0.10.1-555f240-22be241 \
  --model Llama-3.3-70B-Instruct \
  --tt-device p300x2

First run takes a long time. Docker will pull the container image (~15 GB), then download the model weights from HuggingFace (~140 GB). On a 500 Mbps connection, expect 40–60 minutes total. Subsequent starts use the cached Docker volume and take about 3–5 minutes.

What to watch for:

The container logs a lot during initialization. The meaningful signals:

# Docker image pulled and container starting
Starting vLLM server...

# Weights downloading (first run only)
Downloading shards: 100%|████████████████| 30/30

# Hardware initialization — all 4 chips should appear
Opening device 0... OK
Opening device 1... OK
Opening device 2... OK
Opening device 3... OK

# Op graph compilation — compiles Llama ops to Blackhole instructions
Compiling model graphs... (this takes 3-5 minutes)

# Ready
Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

When you see Application startup complete, the server is accepting requests.

Step 3 — Send a request

The server exposes an OpenAI-compatible API on port 8000. Test it with curl:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-3.3-70B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": "Explain tensor parallelism in 3 sentences. Be specific about what moves across chip boundaries."
      }
    ],
    "max_tokens": 200
  }' | python3 -m json.tool

Or pipe straight to the content:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-3.3-70B-Instruct",
    "messages": [{"role": "user", "content": "Write a haiku about Blackhole silicon."}]
  }' | python3 -c "import json,sys; d=json.load(sys.stdin); print(d['choices'][0]['message']['content'])"

Step 4 — Use it from Python

The server is a drop-in replacement for api.openai.com. Any code using the OpenAI SDK works unchanged — just point it at localhost:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-required",  # server doesn't enforce auth
)

response = client.chat.completions.create(
    model="Llama-3.3-70B-Instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a concise technical assistant."
        },
        {
            "role": "user",
            "content": "What are the key differences between BF16 and FP16 for inference?"
        }
    ],
    max_tokens=300,
)

print(response.choices[0].message.content)

Install the SDK if needed:

pip install openai

Step 5 — Watch the hardware work

Open a second terminal while inference is running. The difference between idle and active chips is visible in telemetry:

# Snapshot mode — JSON output, avoids TUI
tt-smi -s

Look for these fields across all four chips:

aiclk — AI clock frequency. Climbs from ~200 MHz at idle to 900–1000 MHz during prefill, settles during decode.
power — Power draw per chip. Expect 75–120W per chip during active inference, ~15W at idle.
temperature — ASIC die temperature. Normal operating range is 50–80°C. The chips have thermal throttling; they will clock down before reaching dangerous temperatures.

A simpler view while a request is processing:

watch -n 1 "tt-smi -s | python3 -c \"
import json, sys
data = json.load(sys.stdin)
for i, chip in enumerate(data.get('device_info', [])):
    print(f'Chip {i}: aiclk={chip.get(\\\"aiclk\\\", \\\"?\\\"):>6} MHz  '
          f'power={chip.get(\\\"power\\\", \\\"?\\\"):>5} W  '
          f'temp={chip.get(\\\"temperature\\\", \\\"?\\\"):>4}°C')
\""

During a long prompt (prefill phase), you’ll see aiclk spike across all four chips simultaneously — that’s tensor parallelism in action. All four chips are processing different attention heads in parallel. During decode (generating tokens one at a time), the pattern changes: aiclk is lower because decode is memory-bandwidth-bound, not compute-bound.

⬡ Tensix Grid — Blackhole (P100/P150/P300c / QB2)

One Blackhole chip during Llama-3.3-70B prefill. All four of yours are doing this in parallel, each handling different layers.

Variant: DeepSeek-R1-Distill-Llama-70B

The same infrastructure runs DeepSeek-R1-Distill-Llama-70B — a reasoning model. It uses the Llama-70B architecture but was fine-tuned to produce explicit chain-of-thought reasoning before giving an answer. The Docker command is identical except for the model name:

docker run \
  --env "HF_TOKEN=$HF_TOKEN" \
  --ipc host \
  --publish 8000:8000 \
  --device /dev/tenstorrent \
  --mount type=bind,src=/dev/hugepages-1G,dst=/dev/hugepages-1G \
  --volume volume_id_DeepSeek-R1-Distill-Llama-70B:/home/container_app_user/cache_root \
  ghcr.io/tenstorrent/tt-inference-server/vllm-tt-metal-src-release-ubuntu-22.04-amd64:0.10.1-555f240-22be241 \
  --model DeepSeek-R1-Distill-Llama-70B \
  --tt-device p300x2

The HuggingFace model ID is deepseek-ai/DeepSeek-R1-Distill-Llama-70B — no gated license, so no need to request access. You do still need a HF token.

The reasoning model produces output in a different format: it wraps its thinking in <think> tags before the final answer. A multi-step math problem or logic puzzle will show its full reasoning chain.

response = client.chat.completions.create(
    model="DeepSeek-R1-Distill-Llama-70B",
    messages=[{
        "role": "user",
        "content": "A train travels at 60 mph for 2 hours, then 90 mph for 1.5 hours. "
                   "What is the average speed for the entire trip?"
    }],
    max_tokens=600,
)
print(response.choices[0].message.content)
# Output starts with <think>...</think> showing the reasoning steps,
# then gives the final answer.

Reasoning models are worth trying on tasks where you want to see the model’s work: code debugging, multi-step math, logic puzzles, structured analysis. The <think> section is the model’s scratch pad — it often catches mistakes it would have made if it had answered directly.

Troubleshooting

Docker can’t find the hugepages mount:

Error response from daemon: invalid mount config for type "bind",
option "source" does not exist: /dev/hugepages-1G

Hugepages aren’t configured. Run the tt-installer script or configure them manually:

echo 'vm.nr_hugepages = 32' | sudo tee /etc/sysctl.d/99-hugepages.conf
sudo sysctl -p /etc/sysctl.d/99-hugepages.conf
sudo mkdir -p /dev/hugepages-1G
sudo mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages-1G

Container starts but model download fails:

huggingface_hub.errors.GatedRepoError: Access to model meta-llama/...

Your HF_TOKEN doesn’t have access to Llama models. Accept the license at huggingface.co/meta-llama/Llama-3.3-70B-Instruct while logged into the same account that generated the token.

Four chips appear in tt-smi but container only finds some:

Verify the driver exposes all devices:

ls /dev/tenstorrent/
# Should show: 0  1  2  3

If you only see some, the KMD may need a reload:

sudo rmmod tenstorrent
sudo modprobe tenstorrent

Server starts but requests return very slowly:

Confirm all four chips are active during inference using tt-smi -s. If only 1–2 show elevated aiclk, tensor parallelism isn’t using all four chips. Verify the --tt-device p300x2 flag is present in your command.

Out of disk space during Docker volume creation:

The default Docker data root is /var/lib/docker. If your root partition is small, move it:

# Check where docker stores data
docker info | grep "Docker Root Dir"

# To move it, stop Docker and edit /etc/docker/daemon.json:
sudo systemctl stop docker
echo '{"data-root": "/your/larger/partition/docker"}' | sudo tee /etc/docker/daemon.json
sudo systemctl start docker

Where this fits

This is the largest model the QB2 runs with official Tenstorrent support. Models beyond the ~70B range eventually need more memory or more chips than the QB2 has — an 8-chip system like a Wormhole t3k or a Blackhole LoudBox (8× p150). The 70B range is the practical ceiling for a single QB2.

Inside that ceiling: Llama-3.3-70B-Instruct is the capable baseline. DeepSeek-R1-Distill-Llama-70B is the reasoning variant. The smaller models in other chapters (Llama-3.1-8B, Qwen3-0.6B) are faster to start and better for experimentation — use those for iteration, and come back here when you want to show someone what the machine can actually do.

Run & build: Serving Models on QB2 → Performance Tuning → Fun Demos →