N150 N300 T3K P100 P300C 30 min Validated

Production Inference with vLLM

⚠️ Note: vLLM requires the HuggingFace model format. If you downloaded the model in Lesson 3 before this update, you may need to re-download to get both Meta and HuggingFace formats. The latest Lesson 3 downloads the complete model with all formats.

Take your AI deployment to the next level with vLLM - a production-grade inference engine that provides OpenAI-compatible APIs, continuous batching, and enterprise features for Tenstorrent hardware.

What is vLLM?

vLLM is an open-source LLM serving library designed for high-throughput, low-latency inference. Tenstorrent maintains a fork that brings vLLM's advanced features to Tenstorrent hardware.

Why vLLM?

🚀 OpenAI-compatible API - drop-in replacement for OpenAI's API
⚡ Continuous batching - efficiently serve multiple users simultaneously
📊 Production-tested - used by companies at scale
🔧 Advanced features - request queuing, priority scheduling, streaming
🎯 Easy deployment - standardized server interface

Journey So Far

Lesson 3: One-shot inference demo
Lesson 4: Interactive chat (custom app, model in memory)
Lesson 5: Flask HTTP API (basic server)
Lesson 6: vLLM (production-grade serving) ← You are here

vLLM vs. Your Flask Server

Feature	Flask (Lesson 5)	vLLM (Lesson 6)
Model Loading	Manual	Automatic
API Compatibility	Custom	OpenAI-compatible
Multiple Users	Sequential	Continuous batching
Request Queuing	Manual	Built-in
Streaming	Manual	Built-in
Production-Ready	Basic	Enterprise-grade
Learning Curve	Easy	Moderate

When to use what:

Flask (Lesson 5): Learning, prototyping, simple use cases
vLLM (Lesson 6): Production, multiple users, scalability

Architecture

graph TB
    Clients[OpenAI SDK / curl / Apps]

    subgraph vLLM["vLLM Server"]
        API[OpenAI-Compatible API]
        Batch[Continuous Batch Engine]
        Backend[TT-Metal Backend]

        API --> Batch
        Batch --> Backend
    end

    Hardware[Tenstorrent Hardware]

    Clients <--> API
    Backend --> Hardware

    style Clients fill:#5347a4,stroke:#fff,color:#fff
    style API fill:#3293b2,stroke:#fff,color:#fff
    style Batch fill:#499c8d,stroke:#fff,color:#fff
    style Backend fill:#499c8d,stroke:#fff,color:#fff
    style Hardware fill:#ffb71b,stroke:#000,color:#000

Prerequisites

tt-metal installed and working (latest main branch - see Step 0 below if you need to update)
Model downloaded (Llama-3.1-8B-Instruct)
Python 3.10+ recommended
~20GB disk space for vLLM installation

Starting Fresh?

If you're jumping directly to this lesson, verify your setup first:

Quick prerequisite checks:

# Hardware detected?
tt-smi

# tt-metal working?
python3 -c "import ttnn; print('✓ tt-metal ready')"

# Model downloaded?
ls ~/models/Llama-3.1-8B-Instruct/config.json

# Python version?
python3 --version  # Need 3.10+

If any checks fail:

No hardware? → See Hardware Detection
No tt-metal? → See Verify Installation

No model? → See Download Model or download now:

hf download meta-llama/Llama-3.1-8B-Instruct \
  --local-dir ~/models/Llama-3.1-8B-Instruct

The Perfect Starting Model: Qwen3-0.6B

Why start with Qwen3-0.6B?

You don't need 8B parameters for production AI. Qwen3-0.6B is a game-changer for development and many production use cases:

🚀 Key Strengths:

✅ Dual Thinking Modes - Switches between fast chat and deep reasoning automatically
✅ Reasoning Excellence - Outperforms many larger models on logic and math (MMLU-Redux: 55.6, MATH-500: 77.6)
✅ Ultra-Lightweight - 0.6B params (13x smaller than 8B models)
✅ Blazing Fast - Sub-millisecond inference, 10,000+ QPS capable
✅ Multilingual - Strong performance across many languages
✅ N150-Perfect - Guaranteed to work on DRAM-constrained systems
✅ 32K Context - Long conversations, document analysis
✅ Cost-Effective - Minimal compute requirements

Download Qwen3-0.6B:

hf download Qwen/Qwen3-0.6B --local-dir ~/models/Qwen3-0.6B

No HuggingFace token needed! Downloads in ~2-3 minutes.

⭐ Best Model for Coding Assistants: Qwen2.5-Coder-1.5B

Building AI coding assistants (Aider, Continue, etc.)? Use Qwen2.5-Coder - it's specialized for code generation:

🎯 Why Qwen2.5-Coder-1.5B is Perfect for Coding:

✅ Code-Specialized Training - Trained specifically on code datasets (Python, JavaScript, C++, etc.)
✅ Excellent Code Completion - Better code suggestions than general-purpose models
✅ Strong Code Understanding - Understands code structure, APIs, and patterns
✅ 1.5B params - Small enough for N150, large enough for quality results
✅ Fast Iteration - Quick responses for coding workflows
✅ N150-Perfect - Fits comfortably on single-chip hardware
✅ No Token Required - Open weights, freely available

Download Qwen2.5-Coder-1.5B-Instruct:

hf download Qwen/Qwen2.5-Coder-1.5B-Instruct --local-dir ~/models/Qwen2.5-Coder-1.5B-Instruct

Takes ~2-3 minutes to download. Perfect for:

AI coding assistants (Aider, Continue)
Code completion and generation
Code explanation and documentation
Bug fixing and refactoring
Learning programming with AI

Need even more code power? Try Qwen2.5-Coder-7B-Instruct (requires N300+):

hf download Qwen/Qwen2.5-Coder-7B-Instruct --local-dir ~/models/Qwen2.5-Coder-7B-Instruct

Need more power? Other options:

📥 Gemma 3-1B-IT - Slightly larger, Google quality

hf download google/gemma-3-1b-it --local-dir ~/models/gemma-3-1b-it

1B params (8x smaller than 8B)
140+ languages supported
32K context window
Good for N150, works on N300

📥 Llama-3.1-8B-Instruct - For N300/T3K/P100 only

hf download meta-llama/Llama-3.1-8B-Instruct --local-dir ~/models/Llama-3.1-8B-Instruct

Requirements:

HuggingFace token (gated model)
N300/T3K/P100 hardware (NOT recommended for N150)
Higher DRAM usage

Step 0: Update and Build TT-Metal (If Needed)

⚠️ Important: vLLM dev branch requires the latest tt-metal. If you get an InputRegistry error or "sfpi not found" error, update and rebuild tt-metal:

cd ~/tt-metal && \
  git checkout main && \
  git pull origin main && \
  git submodule update --init --recursive && \
  sudo ./install_dependencies.sh && \
  ./build_metal.sh

🔧 Update and Build TT-Metal

What this does:

Updates tt-metal to latest main branch
Updates all submodules (including SFPI libraries)
Installs/updates system dependencies (libraries, drivers, build tools)
Rebuilds tt-metal with latest changes
Takes ~5-15 minutes depending on hardware and system state

When to do this:

First time setting up vLLM
After updating tt-metal with git pull
If you see "sfpi not found" errors
If you see "InputRegistry" or other API compatibility errors
After system updates or fresh installations

Why install_dependencies.sh? tt-metal requires specific system libraries, kernel modules, and build tools. This script ensures all dependencies are installed before building. Skipping this step can cause build failures or runtime errors.

Why rebuild? tt-metal includes compiled components (like SFPI) that must be built after code updates. The build_metal.sh script handles all necessary compilation steps.

Verify vLLM Components

Before proceeding, let's check what you already have installed:

# Check if vLLM is cloned
[ -d ~/tt-vllm ] && echo "✓ vLLM repo found" || echo "✗ vLLM repo missing"

# Check if venv exists (correct location integrated with tt-metal)
[ -d ~/tt-metal/build/python_env_vllm ] && echo "✓ vLLM venv found" || echo "✗ vLLM venv missing"

# Check if activation script exists
[ -f ~/activate-vllm-env.sh ] && echo "✓ Activation script found" || echo "✗ Activation script missing"

# Check if server script exists
[ -f ~/tt-scratchpad/start-vllm-server.py ] && echo "✓ Server script found" || echo "✗ Server script missing"

All checks passed? You can skip to Step 4: Start the Server.

Some checks failed? Continue with Step 2 (environment setup) below.

Step 1: Clone TT vLLM Fork

First, get Tenstorrent's vLLM fork:

cd ~ && \
  git clone --branch dev https://github.com/tenstorrent/vllm.git tt-vllm && \
  cd tt-vllm

📦 Clone TT vLLM Repository

What this does:

Clones the dev branch (Tenstorrent's main branch)
Creates ~/tt-vllm directory
Takes ~1-2 minutes depending on connection

Step 2: Set Up vLLM Environment (Critical!)

⚠️ Important: vLLM requires a specific Python environment with exact dependency versions for Tenstorrent hardware compatibility. The most common issue is PyTorch version mismatches.

Automated Setup (Recommended) ⚡

The fastest and most reliable way:

bash ~/tt-scratchpad/setup-vllm-env.sh

What this script does:

✅ Validates prerequisites (tt-metal installed, paths correct)
✅ Creates Python venv at the CORRECT location (${TT_METAL_HOME}/build/python_env_vllm)
✅ Installs PyTorch 2.5.0+cpu (exact version required for TT hardware)
✅ Builds vLLM from source with TT hardware support
✅ Installs all required dependencies (ttnn, pytest, fairscale, etc.)
✅ Validates the installation (tests imports)
✅ Creates convenient activation script (~/activate-vllm-env.sh)

Time: ~5-10 minutes (downloads + compilation)

After completion:

source ~/activate-vllm-env.sh

Why This Matters

Common issue: vLLM on TT hardware requires:

PyTorch 2.5.0+cpu (not 2.7.1, not 2.4.x)
Environment integrated with tt-metal (not standalone venv)
Exact versions from requirements/tt.txt

Without the correct environment, you'll see:

TypeError: must be called with a dataclass type or instance
# ... torch/_inductor/runtime/hints.py errors

The automated script ensures everything is configured correctly!

Manual Setup (Alternative)

If you prefer to do it manually:

# 1. Set up environment variables
cd ~/tt-vllm
export vllm_dir=$(pwd)
source $vllm_dir/tt_metal/setup-metal.sh

# 2. Create Python venv at correct location
python3 -m venv $PYTHON_ENV_DIR
source $PYTHON_ENV_DIR/bin/activate

# 3. Install PyTorch 2.5.0+cpu (specific version!)
pip install --upgrade pip
pip install --index-url https://download.pytorch.org/whl/cpu \
    torch==2.5.0+cpu \
    torchvision==0.20.0 \
    torchaudio==2.5.0

# 4. Install dependencies
pip install --upgrade ttnn pytest
pip install fairscale termcolor loguru blobfile fire pytz llama-models==0.0.48

# 5. Install vLLM from source
cd $vllm_dir
pip install -e . --extra-index-url https://download.pytorch.org/whl/cpu

Validate installation:

python3 -c "import torch; print('✓ torch', torch.__version__)"
python3 -c "import vllm; print('✓ vllm import successful')"
python3 -c "import ttnn; print('✓ ttnn import successful')"

Understanding the Starter Script

The extension automatically creates ~/tt-scratchpad/start-vllm-server.py for you. This production-ready script makes vLLM incredibly easy to use!

✨ New in v0.0.101: Hardware Auto-Detection! ✨ New in v0.0.99: Smart Defaults!

Just specify the model - everything else is auto-configured:

# Minimal command (recommended):
python ~/tt-scratchpad/start-vllm-server.py --model ~/models/Qwen3-0.6B

# Script automatically detects and configures:
# Hardware Detection:
#   → Runs tt-smi -s to detect hardware type
#   → Sets MESH_DEVICE (N150/N300/T3K/P100/P150/GALAXY)
#   → Sets TT_METAL_ARCH_NAME=blackhole (for P100/P150)
#   → Sets TT_METAL_HOME=~/tt-metal (if not already set)
#
# Model Configuration:
#   → --served-model-name Qwen/Qwen3-0.6B
#   → --max-model-len 2048
#   → --max-num-seqs 16
#   → --block-size 64

Override any setting as needed:

# Override hardware detection:
export MESH_DEVICE=N300
python ~/tt-scratchpad/start-vllm-server.py --model ~/models/Qwen3-0.6B

# Override defaults:
python ~/tt-scratchpad/start-vllm-server.py \
  --model ~/models/Qwen3-0.6B \
  --max-model-len 8192

What the script does automatically:

Detects hardware - Runs tt-smi to identify N150/N300/T3K/P100/P150
Sets environment variables - MESH_DEVICE, TT_METAL_ARCH_NAME, TT_METAL_HOME
Registers TT-optimized models - TTLlamaForCausalLM for hardware acceleration
Sets HF_MODEL - Auto-detects org prefix (Qwen/, google/, meta-llama/)
Sets served-model-name - Clean API names (no directory paths)
Applies sensible defaults - Good for development, prevents OOM

Works with any Llama-compatible model:

✅ Qwen3-0.6B, Qwen3-8B, Qwen-2.5-7B-Coder
✅ Gemma 3-1B-IT, Gemma 3-4B-IT
✅ Llama-3.1-8B-Instruct, Llama-3.1-70B-Instruct
✅ Mistral-7B-Instruct, Mistral family
✅ Any Llama-compatible architecture

Key insight: Qwen, Gemma, and Mistral use Llama architecture internally, so they automatically benefit from the TT-optimized Llama implementation!

Want to see the script? Open ~/tt-scratchpad/start-vllm-server.py - it's well-documented and shows exactly how everything works.

Step 3: Create the vLLM Starter Script

Before starting the server, create the script that registers TT models with vLLM:

📝 Create vLLM Starter Script

What this does:

Creates ~/tt-scratchpad/start-vllm-server.py
Registers TT-optimized model implementations (TTLlamaForCausalLM)
Works with Llama, Gemma, Qwen, Mistral, and other Llama-compatible models
Opens the file so you can see how it works

Why you need this:

vLLM doesn't automatically know about Tenstorrent's custom model implementations
Without this script, vLLM will fail with: ValidationError: Cannot find model module 'TTLlamaForCausalLM'
This script must run before vLLM starts

Quick Start: Try It Now!

✨ New in v0.0.101: Ultra-simple one-command start with full hardware auto-detection!

source ~/activate-vllm-env.sh && \
  python ~/tt-scratchpad/start-vllm-server.py --model ~/models/Qwen3-0.6B

That's literally it! The activation script sets up the environment and the starter script auto-detects and configures:

✅ Hardware type (N150/N300/T3K/P100/P150) via tt-smi
✅ MESH_DEVICE environment variable
✅ TT_METAL_ARCH_NAME (blackhole for P100/P150)
✅ TT_METAL_HOME (defaults to ~/tt-metal)
✅ Served model name (Qwen/Qwen3-0.6B)
✅ Sensible defaults (2048 context, 16 seqs, 64 block size)

Model served as Qwen/Qwen3-0.6B with sensible defaults. Works on any hardware!

Want more control? Continue to Step 4 below for hardware-specific configurations with optimized settings.

Step 4: Start the OpenAI-Compatible Server

Now start vLLM with your chosen model and hardware configuration. These commands show all parameters explicitly for learning purposes, but remember - you can use the minimal command above and override only what you need!

✅ Start here: Qwen3-0.6B is the recommended model for N150 - tiny, fast, and smart!

Choose your hardware:

N150 (Wormhole - Single Chip) - Most common for development

✅ Recommended: Qwen3-0.6B - Tiny, fast, reasoning-capable!

Command (tested and working):

source ~/activate-vllm-env.sh && \
  python ~/tt-scratchpad/start-vllm-server.py \
    --model ~/models/Qwen3-0.6B \
    --served-model-name Qwen/Qwen3-0.6B \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 2048 \
    --max-num-seqs 16 \
    --block-size 64

💡 What you get:

~16 concurrent users with 2K context each
Sub-second inference - perfect for development
Reasoning capabilities - dual thinking modes
Zero DRAM issues - guaranteed to work on N150
Clean model name: Qwen/Qwen3-0.6B (not /home/user/models/...)

Note: HF_MODEL is auto-detected! The script automatically sets HF_MODEL=Qwen/Qwen3-0.6B from your --model path.

Alternative: Gemma 3-1B-IT (slightly larger, 32K context)

source ~/activate-vllm-env.sh && \
  python ~/tt-scratchpad/start-vllm-server.py \
    --model ~/models/gemma-3-1b-it \
    --served-model-name google/gemma-3-1b-it \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 2048 \
    --max-num-seqs 12 \
    --block-size 64

⚠️ Not recommended for N150: Llama-3.1-8B

Llama-3.1-8B typically exhausts DRAM on N150. Use Qwen3-0.6B or Gemma 3-1B-IT instead for reliable operation.

If you must try Llama on N150:

source ~/activate-vllm-env.sh && \
  python ~/tt-scratchpad/start-vllm-server.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --served-model-name meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 2048 \
    --max-num-seqs 2 \
    --block-size 64

🚀 Start vLLM with Llama (N150 - Not Recommended)

Warning: Expect DRAM exhaustion errors. Qwen3-0.6B is 13x smaller and works reliably.

N300 (Wormhole - Dual Chip)

source ~/activate-vllm-env.sh && \
  python ~/tt-scratchpad/start-vllm-server.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --served-model-name meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 131072 \
    --max-num-seqs 32 \
    --block-size 64 \
    --tensor-parallel-size 2

🚀 Start vLLM Server (N300)

T3K (Wormhole - 8 Chips)

source ~/activate-vllm-env.sh && \
  python ~/tt-scratchpad/start-vllm-server.py \
    --model ~/models/Llama-3.1-70B-Instruct \
    --served-model-name meta-llama/Llama-3.1-70B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 131072 \
    --max-num-seqs 64 \
    --block-size 64 \
    --tensor-parallel-size 8

🚀 Start vLLM Server (T3K)

Note: This uses the 70B model. Make sure you've downloaded it first.

P100 / P300c (Blackhole - Single Chip)

QB2 / QuietBox users: P300c is architecturally identical to P100. Use MESH_DEVICE=P100 and TT_METAL_ARCH_NAME=blackhole for single-chip lessons. A QuietBox 2 with 4× P300c = 4 independent single-chip devices; for most lessons use device 0 only.

source ~/activate-vllm-env.sh && \
  python ~/tt-scratchpad/start-vllm-server.py \
    --model ~/models/Llama-3.1-8B-Instruct \
    --served-model-name meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192 \
    --max-num-seqs 4 \
    --block-size 64

🚀 Start vLLM Server (P100)

⚠️ Remember: P100/P300c requires TT_METAL_ARCH_NAME=blackhole environment variable.

💡 Memory Tip: These settings use 8K context to avoid OOM errors. For longer context (16K), use --max-model-len 16384 --max-num-seqs 1.

First time setup? Create the starter script before using any of the commands above:

📝 Create vLLM Starter Script

This creates ~/tt-scratchpad/start-vllm-server.py which registers TT models with vLLM. The hardware-specific buttons above will create this automatically if it doesn't exist, but you can also create it manually with this button.

Why a Custom Starter Script?

The Problem: vLLM doesn't automatically know about Tenstorrent's custom model implementations (like TTLlamaForCausalLM). Without registration, vLLM will fail with:

ValidationError: Cannot find model module 'TTLlamaForCausalLM'

The Solution: A production-ready starter script that:

Registers TT models with vLLM's ModelRegistry API before the server starts
Self-contained - No dependency on fragile examples/ directory
Production-ready - Can be version controlled, deployed, and maintained

What the script does:

from vllm import ModelRegistry

# Register TT Llama implementation
ModelRegistry.register_model(
    "TTLlamaForCausalLM",
    "models.tt_transformers.tt.generator_vllm:LlamaForCausalLM"
)

# Then start vLLM server with all your flags

Why not use python -m vllm.entrypoints.openai.api_server directly?

❌ TT models not registered → ValidationError
❌ Falls back to slow HuggingFace Transformers (CPU)
❌ No way to register via CLI flags or environment variables

Why not import from examples/?

❌ examples/ is not production code (may change/move/break)
❌ Creates fragile dependency on repository structure
❌ Not suitable for deployment or version control

✅ Our approach: Self-contained, production-ready script with inline registration

The extension creates this script automatically when you use any of the "Start vLLM Server" buttons above, or you can create it manually with the "Create vLLM Starter Script" button. You can also view/customize it at ~/tt-scratchpad/start-vllm-server.py.

Understanding the Configuration

Environment variables (all hardware types need these):

TT_METAL_HOME=~/tt-metal - Points to tt-metal installation (required by setup-metal.sh)
MESH_DEVICE=<your-hardware> - Targets your specific hardware (N150, N300, T3K, P100)
TT_METAL_ARCH_NAME=<architecture> - Required for Blackhole (P100): Set to blackhole. Wormhole chips (N150/N300/T3K) auto-detect but P100 needs explicit specification.
PYTHONPATH=$TT_METAL_HOME - Required so Python can import TT model classes from tt-metal

vLLM flags (vary by hardware):

--model - Local model path (downloaded in Lesson 3)
--max-model-len - Context limit (64K for single-chip, 128K for multi-chip)
--max-num-seqs - Maximum concurrent sequences (higher on multi-chip)
--block-size - KV cache block size (typically 64)
--tensor-parallel-size - Number of chips to use (only for multi-chip)

What you'll see:

INFO: Loading model meta-llama/Llama-3.1-8B-Instruct
INFO: Initializing TT-Metal backend...
INFO: Model loaded successfully
INFO: Started server process
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Server is ready! Leave this terminal open.

DIY: Switch Models Manually

Want to try a different model? It's easy! Just change the --model path in the command.

Example: Switch from Llama to Qwen on N150:

# Stop the current server (Ctrl+C in the server terminal)

# Start with Qwen instead
source ~/activate-vllm-env.sh && \
  python ~/tt-scratchpad/start-vllm-server.py \
    --model ~/models/Qwen3-8B \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192 \
    --max-num-seqs 4 \
    --block-size 64

That's it! The same script automatically detects Qwen is Llama-compatible and uses the TT-optimized implementation. Same performance, different model.

Try comparing:

Ask Llama: "Write hello world in Python"
Stop server (Ctrl+C)
Switch to Qwen (command above)
Ask Qwen the same question
Notice Qwen might give more detailed code comments (it's optimized for coding!)

For other hardware: Just copy the Qwen command from the Hardware Configuration section above.

Step 5: Test with OpenAI SDK

Open a second terminal and test with the OpenAI Python SDK:

# Install OpenAI SDK if needed
# pip install openai

from openai import OpenAI

# Point to your vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy-key"  # vLLM doesn't require auth by default
)

# Chat completion with Qwen3-0.6B
response = client.chat.completions.create(
    model="Qwen/Qwen3-0.6B",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
    max_tokens=128
)

print(response.choices[0].message.content)

💬 Test with OpenAI SDK

Response:

Machine learning is a subset of artificial intelligence that involves
training algorithms to learn from data and make predictions or decisions...

Why this is powerful: Your code is identical to code that calls OpenAI's API. Just change the base_url!

Step 6: Test with curl

You can also use curl (same API as OpenAI):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
      {"role": "user", "content": "Explain neural networks"}
    ],
    "max_tokens": 128
  }'

🔧 Test with curl

Response:

{
  "id": "cmpl-xxx",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "Qwen/Qwen3-0.6B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Neural networks are computing systems inspired by..."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 45,
    "total_tokens": 50
  }
}

OpenAI-Compatible Endpoints

vLLM implements the OpenAI API specification:

POST /v1/chat/completions

Chat-style completions (like ChatGPT):

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is AI?"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

POST /v1/completions

Text completions (continue a prompt):

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "prompt": "Once upon a time",
    "max_tokens": 100
  }'

GET /v1/models

List available models:

curl http://localhost:8000/v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "Qwen/Qwen3-0.6B",
      "object": "model",
      "owned_by": "tenstorrent"
    }
  ]
}

Streaming Responses

vLLM supports streaming (tokens arrive as they're generated):

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

stream = client.chat.completions.create(
    model="Qwen/Qwen3-0.6B",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True,  # Enable streaming
    max_tokens=200
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end='', flush=True)

Output appears word-by-word as it's generated!

Continuous Batching Demo

vLLM's killer feature: serve multiple users efficiently:

import asyncio
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

async def query(prompt_id, prompt):
    """Send a query"""
    print(f"[{prompt_id}] Sending request...")
    response = client.chat.completions.create(
        model="Qwen/Qwen3-0.6B",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=50
    )
    print(f"[{prompt_id}] Got response: {response.choices[0].message.content[:50]}...")

async def main():
    """Send 5 requests simultaneously"""
    tasks = [
        query(1, "What is AI?"),
        query(2, "Explain Python"),
        query(3, "What is quantum computing?"),
        query(4, "Tell me about space"),
        query(5, "How do computers work?")
    ]
    await asyncio.gather(*tasks)

asyncio.run(main())

vLLM handles all 5 requests efficiently using continuous batching - much better than sequential processing!

Step 7: Showcase - Test Qwen3-0.6B's Reasoning

Qwen3-0.6B's secret weapon: Dual thinking modes! It automatically switches between fast chat and deep reasoning.

Let's test its reasoning capabilities with a classic logic puzzle:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

# Classic reasoning test
response = client.chat.completions.create(
    model="Qwen/Qwen3-0.6B",
    messages=[{
        "role": "user",
        "content": "A farmer has 17 sheep. All but 9 die. How many sheep are left? Think step by step."
    }],
    max_tokens=256
)

print(response.choices[0].message.content)

Expected output:

Let me think through this carefully:

1. The farmer starts with 17 sheep
2. "All but 9 die" means that 9 sheep survive
3. The sheep that die = 17 - 9 = 8 sheep
4. Therefore, 9 sheep remain alive

Answer: 9 sheep are left.

Why this works: Qwen3-0.6B recognizes this requires reasoning and automatically engages its "thinking mode" - even though it's only 0.6B parameters!

Try more reasoning challenges:

# Math reasoning
response = client.chat.completions.create(
    model="Qwen/Qwen3-0.6B",
    messages=[{
        "role": "user",
        "content": "If a train travels 60 miles in 45 minutes, what is its speed in miles per hour?"
    }],
    max_tokens=128
)

# Pattern recognition
response = client.chat.completions.create(
    model="Qwen/Qwen3-0.6B",
    messages=[{
        "role": "user",
        "content": "What comes next in this sequence: 2, 4, 8, 16, __?"
    }],
    max_tokens=64
)

What makes Qwen3-0.6B special:

🧠 Dual Thinking Modes - Automatically engages deep reasoning when needed
🎯 Reasoning Benchmarks - MMLU-Redux: 55.6, MATH-500: 77.6 (impressive for 0.6B!)
⚡ Still Fast - Thinking mode adds minimal latency
💰 Best Value - Sub-1B parameters with reasoning capabilities

This is why Qwen3-0.6B punches way above its weight class!

Advanced Configuration

Custom Parameters

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 2048 \           # Max sequence length
  --max-num-seqs 16 \              # Max concurrent sequences
  --disable-log-requests \          # Reduce logging
  --trust-remote-code              # Allow custom models

Environment Variables

# Control tensor parallelism
export MESH_DEVICE=T3K  # or N150, N300, etc.

# Set cache directory
export HF_HOME=~/hf_cache

# Enable debug logging
export VLLM_LOGGING_LEVEL=DEBUG

Deployment Patterns

Pattern 1: Single Server

Simple deployment for moderate load:

python -m vllm.entrypoints.openai.api_server \
  --model $HF_MODEL \
  --host 0.0.0.0 \
  --port 8000

Good for: Dev/test, small teams, moderate QPS

Pattern 2: Docker Container

Containerized deployment:

FROM tenstorrent/tt-metal:latest

RUN pip install vllm

CMD python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

Good for: Consistent environments, easier scaling

Pattern 3: Load Balanced

Multiple vLLM servers behind nginx:

nginx (load balancer)
  ├── vLLM server 1 (port 8001)
  ├── vLLM server 2 (port 8002)
  └── vLLM server 3 (port 8003)

Good for: High availability, horizontal scaling

Performance Tuning

Tips for best performance:

Set appropriate batch size:

--max-num-seqs 32  # Higher = more throughput, more memory

Optimize sequence length:

--max-model-len 2048  # Match your use case

Enable GPU memory optimization:

--gpu-memory-utilization 0.9  # Use 90% of GPU memory

Monitor metrics:

Watch request latency
Track throughput (requests/sec)
Monitor GPU/NPU utilization

Monitoring and Observability

vLLM provides metrics endpoints:

# Prometheus metrics
curl http://localhost:8000/metrics

# Health check
curl http://localhost:8000/health

# Server stats
curl http://localhost:8000/v1/models

Integration with monitoring tools:

Prometheus for metrics collection
Grafana for visualization
Custom alerting on latency/throughput

Comparison: Your Journey

Approach	Speed	Control	Prod-Ready	Use Case
Lesson 3: One-shot	Slow	Low	❌	Testing
Lesson 4: Direct API	Fast	High	⚠️	Learning
Lesson 5: Flask	Fast	High	⚠️	Prototyping
Lesson 6: vLLM	Fast	Medium	✅	Production

Summary:

Lessons 3-4: Learn how inference works
Lesson 5: Build custom APIs
Lesson 6: Deploy at scale

Each approach serves a purpose - choose based on your needs.

Troubleshooting

Don't worry if you hit issues - they're usually straightforward to fix. Here are common solutions:

Server Won't Start

Check your environment:

# Activate environment
source ~/activate-vllm-env.sh

# Verify model path
ls ~/models/Llama-3.1-8B-Instruct/config.json

PyTorch dataclass errors (TypeError: must be called with a dataclass type or instance): This is the most common environment issue - wrong PyTorch version!

# Check your PyTorch version
source ~/activate-vllm-env.sh
python3 -c "import torch; print('PyTorch version:', torch.__version__)"

If you see anything other than 2.5.0+cpu, recreate your environment:

# Run the automated setup script
bash ~/tt-scratchpad/setup-vllm-env.sh

Import errors (e.g., "No module named 'llama_models'", "No module named 'fairscale'", "No module named 'pytz'", etc.): These usually mean the environment wasn't set up correctly. Best solution: recreate it.

# Run the automated setup script
bash ~/tt-scratchpad/setup-vllm-env.sh

Out of Memory / DRAM Exhausted (N150 Users): If larger models (8B params) exhaust your DRAM on N150, use smaller models:

Recommended small models:

Qwen3-0.6B - 0.6B params (13x smaller than 8B) ✅ Best for N150

# Download and run Qwen3-0.6B
hf download Qwen/Qwen3-0.6B --local-dir ~/models/Qwen3-0.6B

# Start server (use N150 command from Step 4 above)
python ~/tt-scratchpad/start-vllm-server.py --model ~/models/Qwen3-0.6B ...


- **Gemma 3-1B-IT** - 1B params (8x smaller than 8B)
  ```bash
  # Download and run Gemma 3-1B-IT
  hf download google/gemma-3-1b-it --local-dir ~/models/gemma-3-1b-it

  # Start server (use N150 command from Step 4 above)
  python ~/tt-scratchpad/start-vllm-server.py --model ~/models/gemma-3-1b-it ...

Why small models work better on N150:

Minimal DRAM usage - Fits comfortably in N150's memory
Faster inference - Smaller model = faster generation
Same API - Works with all the same commands
Perfect for development - Ideal for testing and iteration

AttributeError: 'InputRegistry' object has no attribute 'register_input_processor': Error: sfpi not found at /home/user/tt-metal/runtime/sfpi: These errors indicate tt-metal needs to be updated and rebuilt. Solution:

# Update and rebuild tt-metal (Step 0)
cd ~/tt-metal
./build_metal.sh --clean       # Clean old build artifacts first
git checkout main
git pull origin main
git submodule update --init --recursive
sudo ./install_dependencies.sh      # Install/update system dependencies
./build_metal.sh               # Build tt-metal

# Then recreate vLLM environment with updated ttnn
bash ~/tt-scratchpad/setup-vllm-env.sh

Why --clean? Removes all cached build artifacts to prevent conflicts between old and new versions. This forces a complete rebuild from scratch.

Why install_dependencies.sh? Ensures all system libraries, kernel modules, and build tools are installed before building. Prevents build failures and runtime errors.

Why rebuild? tt-metal includes compiled components (SFPI libraries, kernels) that must be built after code updates. The vLLM dev branch expects the latest tt-metal APIs.

RuntimeError: Failed to infer device type (Blackhole P100): The start-vllm-server.py script now auto-detects P100 and sets TT_METAL_ARCH_NAME=blackhole automatically!

If auto-detection fails, you can override:

export TT_METAL_ARCH_NAME=blackhole
export MESH_DEVICE=P100
source ~/activate-vllm-env.sh && \
  python ~/tt-scratchpad/start-vllm-server.py \
    --model ~/models/Llama-3.1-8B-Instruct

Why this happens: Blackhole hardware (P100) requires explicit architecture specification. The starter script detects this via tt-smi and sets it automatically. If detection fails, set manually as shown above.

ValidationError: Cannot find model module 'TTLlamaForCausalLM': This error means vLLM cannot find the TT model implementation. Solution:

# Use the starter script (Step 4) which registers TT models
python ~/tt-scratchpad/start-vllm-server.py --model ~/models/Llama-3.1-8B-Instruct

Why this happens: vLLM needs to explicitly register TT models using ModelRegistry.register_model() before starting. The starter script does this automatically. Do NOT call python -m vllm.entrypoints.openai.api_server directly - it will fail because TT models aren't registered.

Verify your starter script exists:

ls -la ~/tt-scratchpad/start-vllm-server.py
# If missing, use the extension button "Create vLLM Server Starter Script" in Lesson 6

Other import errors or virtual environment issues (e.g., "No module named 'xyz'"): Best solution: recreate the environment with the automated script.

# Recreate the vLLM environment (will prompt before removing existing)
bash ~/tt-scratchpad/setup-vllm-env.sh

This ensures:

✅ Correct PyTorch version (2.5.0+cpu)
✅ Correct environment location (integrated with tt-metal)
✅ All dependencies installed properly
✅ Environment validated before completion

Slow inference:

Check --max-num-seqs setting
Monitor GPU/NPU utilization
Reduce --max-model-len if not needed

Out of memory:

Reduce --max-num-seqs
Reduce --max-model-len
Close other programs

What You Learned

✅ How to install and configure vLLM for Tenstorrent
✅ OpenAI-compatible API usage
✅ Continuous batching for efficient serving
✅ Streaming responses
✅ Production deployment patterns
✅ Performance monitoring and tuning

Key takeaway: vLLM bridges the gap between custom code and production deployment, giving you enterprise features while maintaining compatibility with standard APIs.

Bonus Lap: AI Coding Agents - Build Something Right Now!

You just got vLLM running - let's immediately put it to work! 🚀

Now that your local model server is running, you can connect AI coding agents to build projects with AI assistance. This is 100% private (your code never leaves your machine), zero API costs, and surprisingly capable.

Why This Matters

100% Private - All AI runs locally on your Tenstorrent hardware
Zero Cost - No OpenAI/Anthropic API fees
Fast - Specialized hardware acceleration
Full Control - See exactly how the AI assists you
Educational - Learn by watching AI write code

Prerequisites

Before starting, make sure:

✅ vLLM server is running from the previous steps (test with curl http://localhost:8000/health)
✅ Model is loaded and responding
✅ You have Python 3.9+ and git installed

Option 1: Aider CLI Agent (Recommended)

Aider is a powerful CLI tool that edits your code files directly with full git integration.

Quick Setup (Automated) ⚡

The fastest way! Run our automated setup script:

bash ~/tt-scratchpad/setup-aider.sh

This script automatically:

✅ Creates Python virtual environment (~/aider-venv)
✅ Installs aider-chat
✅ Configures Aider for Qwen2.5-Coder
✅ Creates wrapper script (aider-tt)
✅ Tests connection to vLLM server

Takes ~2 minutes. After completion, just run aider-tt to start!

Manual Setup (Alternative)

Prefer to do it manually? Follow these steps:

# Create dedicated virtual environment for Aider
python3 -m venv ~/aider-venv
source ~/aider-venv/bin/activate

# Install Aider
pip install aider-chat

# Verify installation
aider --version

Configure Aider for Your Local Model

Create Aider's configuration file:

# Create config directory
mkdir -p ~/.aider

# Create config file
cat > ~/.aider/aider.conf.yml << 'EOF'
# Aider configuration for local vLLM server

# Use OpenAI-compatible API format with Qwen2.5-Coder (code-specialized model!)
model: openai/Qwen/Qwen2.5-Coder-1.5B-Instruct

# Point to your local vLLM server
openai-api-base: http://localhost:8000/v1

# No API key needed for local server
openai-api-key: sk-no-key-required

# Model settings optimized for Qwen2.5-Coder
max-tokens: 2048
temperature: 0.6

# Git settings
auto-commits: false
dirty-commits: true
EOF

echo "✓ Aider configuration created at ~/.aider/aider.conf.yml"

Why Qwen2.5-Coder? It's specifically trained for coding tasks and will give you much better results than general-purpose models for code generation, refactoring, and bug fixing!

Test Aider Connection

# Activate Aider environment
source ~/aider-venv/bin/activate

# Quick connection test (will exit immediately)
aider --model openai/Qwen/Qwen2.5-Coder-1.5B-Instruct \
      --openai-api-base http://localhost:8000/v1 \
      --openai-api-key sk-no-key-required \
      --yes \
      --message "/exit"

If you see the Aider prompt, you're connected! ✅

Your First AI-Assisted Project

Let's build a simple task manager to see Aider in action:

# Create project directory
mkdir -p ~/ai-projects/task-manager
cd ~/ai-projects/task-manager

# Initialize git (Aider loves git!)
git init
git config user.name "Your Name"
git config user.email "you@example.com"

# Create initial README
cat > README.md << 'EOF'
# Task Manager CLI

A command-line task manager built with AI assistance.
EOF

git add README.md
git commit -m "Initial commit"

# Start Aider with code-specialized model
aider --model openai/Qwen/Qwen2.5-Coder-1.5B-Instruct \
      --openai-api-base http://localhost:8000/v1 \
      --openai-api-key sk-no-key-required

Now you're in Aider! Try these prompts:

Aider> Create a task_manager.py file that implements a CLI task manager with add, list, and complete commands using argparse. Store tasks in a JSON file.

Aider> Add error handling for file operations

Aider> /diff
# Shows what changes were made

Aider> /run python task_manager.py add "Test task"
# Test your code!

Aider> /commit
# Commits changes with AI-generated commit message

Aider> /exit

Create a Convenient Wrapper Script (Optional)

Make Aider easier to launch:

# Create wrapper script
mkdir -p ~/bin
cat > ~/bin/aider-tt << 'EOF'
#!/bin/bash
# Aider wrapper for Tenstorrent local models

source ~/aider-venv/bin/activate

# Check if server is running
if ! curl -s http://localhost:8000/health > /dev/null 2>&1; then
    echo "ERROR: vLLM server is not running at http://localhost:8000"
    echo "Start the server first (see Lesson 6)."
    exit 1
fi

# Run Aider with local code-specialized model
exec aider \
    --model openai/Qwen/Qwen2.5-Coder-1.5B-Instruct \
    --openai-api-base http://localhost:8000/v1 \
    --openai-api-key sk-no-key-required \
    "$@"
EOF

chmod +x ~/bin/aider-tt

# Add to PATH
echo 'export PATH="$HOME/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Now you can just type: aider-tt

Useful Aider Commands

# Inside Aider prompt
/help                 # Show all commands
/add <file>          # Add file to chat context
/drop <file>         # Remove file from context
/diff                # Show pending changes
/undo                # Undo last change
/commit              # Commit with AI message
/run <command>       # Run shell command
/exit                # Exit Aider

# Starting Aider with specific files
aider file1.py file2.py    # Add files immediately to context

Option 2: Continue VSCode Extension

Continue brings AI assistance directly into VSCode. Great if you prefer IDE workflows.

Install Continue

Open VSCode
Go to Extensions (Ctrl+Shift+X / Cmd+Shift+X)
Search for "Continue"
Click "Install"

Configure Continue

Click the Continue icon in the sidebar
Click the gear icon (⚙️) to open settings
Replace the config with:

{
  "models": [
    {
      "title": "Qwen2.5-Coder 1.5B (Local TT - Code Specialist)",
      "provider": "openai",
      "model": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
      "apiBase": "http://localhost:8000/v1",
      "apiKey": "sk-no-key-required"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Llama 3.2 3B (Local TT)",
    "provider": "openai",
    "model": "meta-llama/Llama-3.2-3B-Instruct",
    "apiBase": "http://localhost:8000/v1",
    "apiKey": "sk-no-key-required"
  },
  "allowAnonymousTelemetry": false
}

Save (Ctrl+S / Cmd+S)
Reload window: Ctrl+Shift+P → "Developer: Reload Window"

Using Continue

Chat Interface:

Click Continue icon in sidebar
Select model from dropdown
Start chatting about your code

Inline Editing:

Highlight code in editor
Press Ctrl+I (Cmd+I on Mac)
Type instructions (e.g., "Add error handling")
Press Enter

Tab Autocomplete:

Just start typing
Continue suggests completions
Press Tab to accept

Example Workflow: Build a Weather CLI

Let's build a complete project using your local AI:

# Setup
mkdir -p ~/ai-projects/weather-cli
cd ~/ai-projects/weather-cli
git init

# Start Aider
source ~/aider-venv/bin/activate
aider --model openai/meta-llama/Llama-3.2-3B-Instruct \
      --openai-api-base http://localhost:8000/v1 \
      --openai-api-key sk-no-key-required

Step-by-step prompts in Aider:

1. Create a weather.py file that fetches weather data from wttr.in using the requests library.

2. Add a CLI interface using click that accepts a city name and displays temperature and conditions.

3. Add colored output using colorama to make it visually appealing.

4. Create a requirements.txt with all dependencies.

5. Add error handling for network failures and invalid cities.

6. Create a README.md with installation and usage instructions.

7. Create tests in test_weather.py using pytest.

Test your app:

# Install dependencies
pip install -r requirements.txt

# Run the app
python weather.py "San Francisco"
python weather.py "Tokyo"

Best Practices for AI-Assisted Coding

1. Start Small

# Good: Specific, focused request
"Add input validation to the login function"

# Too broad: Vague, hard to implement
"Make the app better"

2. Iterate Incrementally

# Step 1
"Create a basic user class with name and email fields"

# Step 2
"Add password hashing to the user class"

# Step 3
"Add validation for email format"

3. Provide Context

# Good: Provides context
"Add error handling to the API call in fetch_data(). Handle network timeouts, 404s, and JSON decode errors."

# Less effective: Lacks context
"Add error handling"

4. Use Git Effectively

# Commit frequently with Aider
Aider> /commit

# Review changes before committing
Aider> /diff

# Undo if needed
Aider> /undo

5. Test as You Go

# Test after each feature
Aider> /run pytest
Aider> /run python app.py --test-mode

Troubleshooting

Issue: "Connection refused" to local model

# Check if server is running
curl http://localhost:8000/health

# If not running, restart from Step 4 of this lesson
# Go back to the server terminal and verify it's running

Issue: Slow responses from model

# Reduce max_tokens in Aider config
# Edit ~/.aider/aider.conf.yml
max-tokens: 512  # Instead of 2048

# Use shorter, more focused prompts

Issue: Model gives poor suggestions

# Be more specific in your instructions
"Add error handling for FileNotFoundError and PermissionError when reading config.json"

# Provide examples
"Create a function similar to this: [paste example code]"

# Iterate with feedback
"The previous code had a bug where X. Fix it by doing Y."

Issue: Aider won't start

# Ensure virtual environment is activated
source ~/aider-venv/bin/activate

# Reinstall if needed
pip install --upgrade aider-chat

# Check Python version (must be 3.9+)
python --version

Example Projects to Try

Beginner: Todo List App

CLI with add/list/complete/delete commands
JSON file storage
Tests with pytest
~30 minutes with AI assistance

Intermediate: REST API

FastAPI server with CRUD endpoints
SQLite database
Request validation
Basic authentication
~60 minutes with AI assistance

Advanced: Data Analyzer

Read CSV files
Data analysis with pandas
Generate visualizations with matplotlib
Export reports
~90 minutes with AI assistance

Comparing Aider vs Continue

Feature	Aider (CLI)	Continue (VSCode)
Interface	Command line	VSCode integrated
Git Integration	Excellent (auto-commits)	Manual
Multi-file Editing	Native support	Context-based
Tab Completion	No	Yes
Inline Editing	No	Yes
Best For	Focused coding sessions	Continuous development

Recommendation:

Use Aider for: New projects, refactoring, focused feature work
Use Continue for: Daily development, quick edits, exploration

Next Steps

You've completed the walkthrough! 🎉

Where to go from here:

Build Applications:
- Integrate with your existing services
- Build chat interfaces
- Create AI-powered features
Optimize Performance:
- Tune batch sizes for your workload
- Implement caching strategies
- Monitor and optimize
Scale Up:
- Deploy multiple instances
- Add load balancing
- Implement autoscaling
Explore More Models:
- Try different Llama variants
- Test Mistral, Qwen, etc.
- Fine-tune for your use case

Learn More

TT vLLM Fork: github.com/tenstorrent/vllm
vLLM Docs: docs.vllm.ai
OpenAI API Reference: platform.openai.com/docs
TT-Metal Docs: docs.tenstorrent.com

Community & Support

GitHub Issues: Report bugs and request features
Discord: Join the Tenstorrent community
Documentation: Check the tt-metal README

Thank you for completing this walkthrough! You now have the knowledge to build, deploy, and scale AI applications on Tenstorrent hardware. 🚀