Cross-track lesson · All paths
A full 70-billion parameter model, running locally on the four Blackhole chips in your TT-QuietBox® 2. No cloud. No API key. Just your hardware.
Llama-3.3-70B-Instruct from Meta. 70 billion parameters — the largest Llama model that fits on a single QB2. This is the model that, two years ago, required a dedicated cloud VM with 8× A100s. Your QB2 has four Blackhole chips (on two p300c cards); together they have enough DRAM bandwidth and capacity to run it.
The same command also runs these weight variants:
tt-inference-server identifies the QB2 as p300x2 — two p300c cards, four Blackhole chips. That’s the --tt-device value to pass for a model that needs the whole box, like this one.
Docker must be installed. The tt-inference-server uses Docker containers to manage the environment. If you’ve completed the Explore track, Docker is already present. Verify:
docker --version
# Docker version 24.x or later
HuggingFace token with Llama access. Meta’s Llama models require accepting a license agreement on HuggingFace and using a token. This is a one-time step.
export HF_TOKEN=hf_your_token_here
# Add to ~/.bashrc to persist across sessions:
echo 'export HF_TOKEN=hf_your_token_here' >> ~/.bashrc
Disk space. The model weights are approximately 140 GB. Docker volumes store them in /var/lib/docker/volumes/. Make sure you have that space available:
df -h /var/lib/docker
Hugepages. The Tenstorrent driver requires 1G hugepages. If you’ve run any model before, these are already configured. To verify:
cat /proc/meminfo | grep HugePages
# HugePages_Total should be > 0
If hugepages are missing, the tt-installer script sets them up. See the install chapter.
tt-inference-server is Tenstorrent’s Docker-based deployment tool. It wraps a TT-Metal-optimized fork of vLLM with one-command launch syntax.
git clone https://github.com/tenstorrent/tt-inference-server ~/code/tt-inference-server
cd ~/code/tt-inference-server
If you already have a clone, update it:
cd ~/code/tt-inference-server
git pull
The simplest path is the run.py helper from tt-inference-server — one command that pulls the container, downloads and compiles the weights, and maps the port:
cd ~/code/tt-inference-server
python3 run.py --model Llama-3.3-70B-Instruct --tt-device p300x2 --workflow server --docker-server
Under the hood, run.py launches the TT vLLM container. If you’d rather drive Docker yourself — to pin flags, or run without the repo — the equivalent is:
docker run \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc host \
--publish 8000:8000 \
--device /dev/tenstorrent \
--mount type=bind,src=/dev/hugepages-1G,dst=/dev/hugepages-1G \
--volume volume_id_Llama-3.3-70B-Instruct:/home/container_app_user/cache_root \
ghcr.io/tenstorrent/tt-inference-server/vllm-tt-metal-src-release-ubuntu-22.04-amd64:0.10.1-555f240-22be241 \
--model Llama-3.3-70B-Instruct \
--tt-device p300x2
What to watch for:
The container logs a lot during initialization. The meaningful signals:
# Docker image pulled and container starting
Starting vLLM server...
# Weights downloading (first run only)
Downloading shards: 100%|████████████████| 30/30
# Hardware initialization — all 4 chips should appear
Opening device 0... OK
Opening device 1... OK
Opening device 2... OK
Opening device 3... OK
# Op graph compilation — compiles Llama ops to Blackhole instructions
Compiling model graphs... (this takes 3-5 minutes)
# Ready
Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000
When you see Application startup complete, the server is accepting requests.
The server exposes an OpenAI-compatible API on port 8000. Test it with curl:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.3-70B-Instruct",
"messages": [
{
"role": "user",
"content": "Explain tensor parallelism in 3 sentences. Be specific about what moves across chip boundaries."
}
],
"max_tokens": 200
}' | python3 -m json.tool
Or pipe straight to the content:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.3-70B-Instruct",
"messages": [{"role": "user", "content": "Write a haiku about Blackhole silicon."}]
}' | python3 -c "import json,sys; d=json.load(sys.stdin); print(d['choices'][0]['message']['content'])"
The server is a drop-in replacement for api.openai.com. Any code using the OpenAI SDK works unchanged — just point it at localhost:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-required", # server doesn't enforce auth
)
response = client.chat.completions.create(
model="Llama-3.3-70B-Instruct",
messages=[
{
"role": "system",
"content": "You are a concise technical assistant."
},
{
"role": "user",
"content": "What are the key differences between BF16 and FP16 for inference?"
}
],
max_tokens=300,
)
print(response.choices[0].message.content)
Install the SDK if needed:
pip install openai
Open a second terminal while inference is running. The difference between idle and active chips is visible in telemetry:
# Snapshot mode — JSON output, avoids TUI
tt-smi -s
Look for these fields across all four chips:
aiclk — AI clock frequency. Climbs from ~200 MHz at idle to 900–1000 MHz during prefill, settles during decode.power — Power draw per chip. Expect 75–120W per chip during active inference, ~15W at idle.temperature — ASIC die temperature. Normal operating range is 50–80°C. The chips have thermal throttling; they will clock down before reaching dangerous temperatures.A simpler view while a request is processing:
watch -n 1 "tt-smi -s | python3 -c \"
import json, sys
data = json.load(sys.stdin)
for i, chip in enumerate(data.get('device_info', [])):
print(f'Chip {i}: aiclk={chip.get(\\\"aiclk\\\", \\\"?\\\"):>6} MHz '
f'power={chip.get(\\\"power\\\", \\\"?\\\"):>5} W '
f'temp={chip.get(\\\"temperature\\\", \\\"?\\\"):>4}°C')
\""
During a long prompt (prefill phase), you’ll see aiclk spike across all four chips simultaneously — that’s tensor parallelism in action. All four chips are processing different attention heads in parallel. During decode (generating tokens one at a time), the pattern changes: aiclk is lower because decode is memory-bandwidth-bound, not compute-bound.
One Blackhole chip during Llama-3.3-70B prefill. All four of yours are doing this in parallel, each handling different layers.
The same infrastructure runs DeepSeek-R1-Distill-Llama-70B — a reasoning model. It uses the Llama-70B architecture but was fine-tuned to produce explicit chain-of-thought reasoning before giving an answer. The Docker command is identical except for the model name:
docker run \
--env "HF_TOKEN=$HF_TOKEN" \
--ipc host \
--publish 8000:8000 \
--device /dev/tenstorrent \
--mount type=bind,src=/dev/hugepages-1G,dst=/dev/hugepages-1G \
--volume volume_id_DeepSeek-R1-Distill-Llama-70B:/home/container_app_user/cache_root \
ghcr.io/tenstorrent/tt-inference-server/vllm-tt-metal-src-release-ubuntu-22.04-amd64:0.10.1-555f240-22be241 \
--model DeepSeek-R1-Distill-Llama-70B \
--tt-device p300x2
The HuggingFace model ID is deepseek-ai/DeepSeek-R1-Distill-Llama-70B — no gated license, so no need to request access. You do still need a HF token.
The reasoning model produces output in a different format: it wraps its thinking in <think> tags before the final answer. A multi-step math problem or logic puzzle will show its full reasoning chain.
response = client.chat.completions.create(
model="DeepSeek-R1-Distill-Llama-70B",
messages=[{
"role": "user",
"content": "A train travels at 60 mph for 2 hours, then 90 mph for 1.5 hours. "
"What is the average speed for the entire trip?"
}],
max_tokens=600,
)
print(response.choices[0].message.content)
# Output starts with <think>...</think> showing the reasoning steps,
# then gives the final answer.
Reasoning models are worth trying on tasks where you want to see the model’s work: code debugging, multi-step math, logic puzzles, structured analysis. The <think> section is the model’s scratch pad — it often catches mistakes it would have made if it had answered directly.
Docker can’t find the hugepages mount:
Error response from daemon: invalid mount config for type "bind",
option "source" does not exist: /dev/hugepages-1G
Hugepages aren’t configured. Run the tt-installer script or configure them manually:
echo 'vm.nr_hugepages = 32' | sudo tee /etc/sysctl.d/99-hugepages.conf
sudo sysctl -p /etc/sysctl.d/99-hugepages.conf
sudo mkdir -p /dev/hugepages-1G
sudo mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages-1G
Container starts but model download fails:
huggingface_hub.errors.GatedRepoError: Access to model meta-llama/...
Your HF_TOKEN doesn’t have access to Llama models. Accept the license at huggingface.co/meta-llama/Llama-3.3-70B-Instruct while logged into the same account that generated the token.
Four chips appear in tt-smi but container only finds some:
Verify the driver exposes all devices:
ls /dev/tenstorrent/
# Should show: 0 1 2 3
If you only see some, the KMD may need a reload:
sudo rmmod tenstorrent
sudo modprobe tenstorrent
Server starts but requests return very slowly:
Confirm all four chips are active during inference using tt-smi -s. If only 1–2 show elevated aiclk, tensor parallelism isn’t using all four chips. Verify the --tt-device p300x2 flag is present in your command.
Out of disk space during Docker volume creation:
The default Docker data root is /var/lib/docker. If your root partition is small, move it:
# Check where docker stores data
docker info | grep "Docker Root Dir"
# To move it, stop Docker and edit /etc/docker/daemon.json:
sudo systemctl stop docker
echo '{"data-root": "/your/larger/partition/docker"}' | sudo tee /etc/docker/daemon.json
sudo systemctl start docker
This is the largest model the QB2 runs with official Tenstorrent support. Models beyond the ~70B range eventually need more memory or more chips than the QB2 has — an 8-chip system like a Wormhole t3k or a Blackhole LoudBox (8× p150). The 70B range is the practical ceiling for a single QB2.
Inside that ceiling: Llama-3.3-70B-Instruct is the capable baseline. DeepSeek-R1-Distill-Llama-70B is the reasoning variant. The smaller models in other chapters (Llama-3.1-8B, Qwen3-0.6B) are faster to start and better for experimentation — use those for iteration, and come back here when you want to show someone what the machine can actually do.