Path

Run and build on top of models

vLLM, performance tuning, TT-Forge, and multi-chip inference.

This track is for ML engineers who already know PyTorch and CUDA. It covers the conceptual remapping from GPU compute to Tensix architecture, the model families the QB2 supports and how to acquire them, deploying models as a production OpenAI-compatible API with tt-inference-server (which wraps the TT vLLM fork), and reading hardware metrics to understand what the chips are actually doing. No kernel writing, no Metalium — those are in the Tinker track. This is the inference practitioner path: fast, production-oriented, grounded in the hardware.

Start Reading → Read it all

1 Coming From CUDA 8 min 2 The Model Zoo 6 min 3 Serving Models on QB2 10 min 4 Performance Tuning 8 min 5 Going Deeper 4 min 6 TT-Forge: Compile Anything 12 min