What Comes Next

You unboxed a machine that most people have never touched. You confirmed four Blackhole chips were alive and talking to the system. You navigated Python environments that would trip up someone who wasn’t paying attention. You ran a model on accelerator hardware and watched tokens come back. That’s not a tutorial warmup — that’s the actual thing.

The rest is up to you.

Inference stack diagram showing the path from user interfaces through tt-inference-server and vLLM down to four Blackhole chips

Tools in Your World

The QB2 ships with a full stack, but the ecosystem is bigger. Start with tt-toplike — htop for your chips, except the telemetry comes alive as ASCII art:

tt-toplike insights mode — live ASCII visualization of all four Blackhole chips during inference — tt-toplike insights mode — all four Blackhole chips under live inference, power and DRAM state rendered in real time

GitHub ↗ tt-toplike Real-time hardware monitor — htop for your chips: temps, power, utilization, DRAM bandwidth, live in the terminal. sudo apt install tt-toplike GitHub ↗ tt-studio Web UI for model serving. Pick a model, click Run, get tokens — and as of v2.8.0 it can back Claude Code / OpenCode and generate video and images too. tt-studio → localhost:3000 Site ↗ tt-local-generator GTK4 desktop app for video, image, and art generation on QB2, on top of tt-inference-server. tt-local-generator GitHub ↗ tt-inference-server Docker-based one-command model deployment — the OpenAI-compatible server tt-studio and tt-local-generator route through. Site ↗ tt-vscode-toolkit VS Code extension with 40+ interactive lessons that run directly against your QB2. Site ↗ tt-awesome Community catalog of everything built on Tenstorrent hardware — models, demos, benchmarks, research.

Where to Go From Here

Pick a thing you want to do and jump straight in.

Lesson ↗ Production Inference with vLLM Serve a model behind an OpenAI-compatible API. 30 min Lesson ↗ TT-Inference-Server Run Llama-3.1-8B with one command. 20 min Lesson ↗ Interactive Chat Chat with an LLM directly in Python. 20 min Lesson → Running Llama-3.3-70B on QB2 Run the biggest model QB2 supports, across all four chips. 45 min Lesson → Claude Code on your QB2 New in tt-studio v2.8.0 — point Claude Code or OpenCode at a model running on your own chips. No cloud, no per-token bill. coding agents Lesson ↗ Local AI Agents on QB2 Run AI agents locally on a 70B model. 60 min Lesson ↗ QB2 Video Generation Generate video on your QB2. 45 min Lesson ↗ Explore TT-Metalium Build kernels from scratch on the Tensix cores. open-ended Lesson ↗ Cookbook Overview Write cookbook-style parallel algorithms. varies

Choose Your Next Track

Run & build →

Serve real models. Understand performance. Integrate with your existing ML workflow. If you're coming from CUDA, this is where the familiar parts live and where the new parts pay off.

Tinker →

Write code that runs on the chips directly — kernels, data movement, compute pipelines. The architecture goes all the way down and you can follow it.

Customize →

Customize, illuminate, break, and fix things. The LEDs, the desktop, the demos that make people stop and ask what that machine is.

The QB2 is a beginning. There’s a lot of surface area here, and you’ve only scratched it.

← Back to Explore