Going Deeper

You’ve rerouted the mental model, picked a model that fits the hardware, stood up a production inference server, and watched the hardware breathe through prefill and decode. That’s the Run & build track done. What it opens up is considerably larger.

Interactive Lessons in tt-vscode-toolkit

The VS Code extension ships lessons that run against your QB2 directly — not simulated, not mocked. Real inference, real hardware feedback, real timing numbers. Each lesson is a structured walkthrough with code cells you execute against the machine.

Production Inference with vLLM 30 min

Multi-user load testing, request queuing, continuous batching mechanics, latency vs. throughput tradeoff measurement on live hardware.

TT-Inference-Server 20 min

Docker-based one-command deploy. Model switching. Container lifecycle management. The path from development to something you'd actually run in production.

Explore TT-Metalium open-ended

The layer below TTNN. How Metalium kernels are written, compiled, and dispatched. How NoC routing works in practice. How the tensor parallel AllReduce crosses chip boundaries without touching the host CPU.

Cookbook Overview varies

Parallel algorithm patterns for Tensix. Matrix multiply, convolution, attention, and more — written at the TTNN level with performance notes for Blackhole.

Three Things to Try Next

Run Llama-3.3-70B with all four chips. The largest model QB2 officially supports: 70 billion parameters, 128K context, tensor-parallel across all four Blackhole chips. The lesson has the exact Docker command, prerequisites checklist, and a variant for the DeepSeek-R1 reasoning model that uses the same infrastructure. Download the weights (140 GB — plan ahead), start the server, and run a request that would be genuinely difficult to answer. Watch tt-smi -s while it generates — the hardware doing real work looks different from the hardware doing toy work.

Build a Python application against the OpenAI-compatible API. The server is running on localhost:8000. The OpenAI SDK works unchanged. Take something you’ve built against api.openai.com — a chatbot, a summarizer, a classification pipeline — and point it at your QB2. Measure the latency. Compare the cost per token. This is where the practical value of local inference becomes tangible rather than theoretical.

Take the Tinker track. The Run & build track ends at the TTNN surface. The Tinker track goes below it: Metalium kernels, NoC data movement, dispatch programming, the full architecture exposure. If you’ve ever wanted to understand how a matmul actually runs on silicon — not the math, the execution — that track is the path.

Lesson → Running Llama-3.3-70B on QB2 The largest model QB2 officially supports, tensor-parallel across all four chips — exact Docker command, prerequisites, and a DeepSeek-R1 variant. all four chips Lesson ↗ Explore TT-Metalium The layer below TTNN — how Metalium kernels are written, compiled, and dispatched, and how tensor-parallel AllReduce crosses chip boundaries. open-ended

Community and Further Reading

tt-toplike docs

Full reference for every mode and metric. Understand what the numbers mean and what actions they suggest.

tt-awesome

Community catalog of everything built on Tenstorrent hardware. Models, benchmarks, integrations, demos. If someone has run it on a Blackhole, it shows up here.

Choose Your Next Track

Tinker →

Write code that runs directly on the Tensix cores. Metalium kernels, NoC data movement, compute pipelines from scratch. The architecture goes all the way down — this track follows it.

Customize →

Customize, illuminate, and demo the machine. The LEDs, the desktop setup, the demos that make people stop and ask what that thing is running.

You ran serious inference on serious hardware and you understand why it works the way it does. That’s a meaningful thing to know. The QB2 is a beginning, and you’ve got your bearings.

← Performance Tuning | TT-Forge: Compile Anything →