N150 N300 T3K P100 P150 P300C Galaxy 10 min Validated

Interactive Chat with Direct API

Build your own interactive chat application using tt-metal's Generator API directly.

⚠️ Llama + tt-metal source required. The Generator API used in this lesson is Llama-specific and requires ~/tt-metal cloned and built from source. If you:

  • Haven't accepted Meta's data terms for Llama access, or
  • Don't have ~/tt-metal built (QB2 and pre-configured images don't ship it)

→ Use the vLLM path with Qwen3-0.6B instead — no source build, no license gate, works on all hardware.

Why Use the Direct API?

The Generator API is the foundation for building real AI applications. This lesson teaches you how to:

Instead of running inference once and exiting, you'll keep the model in memory and chat with it interactively - the same pattern used by ChatGPT and other conversational AI systems.

How It Works

The Generator API pattern:

sequenceDiagram
    participant User
    participant Generator
    participant Model
    participant Hardware

    Note over User,Hardware: Setup - 2-5 min, once
    User->>Generator: create_tt_model()
    Generator->>Model: Load weights
    Model->>Hardware: Allocate DRAM

    Note over User,Hardware: Chat Loop - 1-3 sec each
    loop Each Query
        User->>Generator: Input prompt
        Generator->>Model: Prefill forward
        Model->>Hardware: Process prompt
        Hardware-->>Model: Logits

        loop Token Generation
            Generator->>Model: Decode forward
            Model->>Hardware: Next token
            Hardware-->>Generator: Token
        end

        Generator-->>User: Response
    end

Code pattern:

# 1. Load model once (slow - 2-5 minutes)
from models.tt_transformers.tt.generator import Generator
from models.tt_transformers.tt.common import create_tt_model

model_args, model, tt_kv_cache, _ = create_tt_model(mesh_device, ...)
generator = Generator([model], [model_args], mesh_device, ...)

# 2. Chat loop - reuse the loaded model! (fast - 1-3 seconds per response)
while True:
    prompt = input("> ")

    # Preprocess
    tokens, encoded, pos, lens = preprocess_inputs_prefill([prompt], ...)

    # Prefill (process the prompt)
    logits = generator.prefill_forward_text(tokens, ...)

    # Decode (generate response token by token)
    for _ in range(max_tokens):
        logits = generator.decode_forward_text(...)
        next_token = sample(logits)
        if is_end_token(next_token):
            break

    response = tokenizer.decode(all_tokens)
    print(response)

Key insight: The model stays in memory between queries!


Starting Fresh?

If you're jumping directly to this lesson, verify your setup:

Quick Prerequisite Checks

# Hardware detected?
tt-smi -s

# tt-metal installed?
python3 -c "import ttnn; print('✓ tt-metal ready')"

# Model downloaded (Meta format)?
ls ~/models/Llama-3.1-8B-Instruct/original/consolidated.00.pth

All checks passed? Continue to Step 1 below.

If any checks fail, complete these lessons first:

Issue Solution
No hardware detected Hardware Detection
No tt-metal installed Verify Installation or installation guide
No model downloaded Download Model

Quick model download:

hf auth login --token "$HF_TOKEN"
hf download meta-llama/Llama-3.1-8B-Instruct --local-dir ~/models/Llama-3.1-8B-Instruct

Dependencies Required

This lesson uses the Generator API which needs:

pip install pi  # Required for Generator API
pip install git+https://github.com/tenstorrent/llama-models.git@tt_metal_tag

Already installed? Check with:

python3 -c "import pi; print('✓ pi installed')"

Not installed? Run the commands above or use the button in Step 1.


Prerequisites

This lesson requires the same setup as Lesson 3. Make sure you have:


Step 1: Install Dependencies (If Not Already Done)

The Direct API needs specific Python packages:

pip install pi && pip install git+https://github.com/tenstorrent/llama-models.git@tt_metal_tag

🔧 Install Direct API Dependencies
pip install pi && pip install git+https://github.com/tenstorrent/llama-models.git@tt_metal_tag

What this installs:

Already installed? The command will skip packages that are already present.


Step 2: Create the Direct API Chat Script

This command creates ~/tt-scratchpad/tt-chat-direct.py - a standalone chat client using the Generator API:

# Creates the direct API chat script
mkdir -p ~/tt-scratchpad && cp template ~/tt-scratchpad/tt-chat-direct.py && chmod +x ~/tt-scratchpad/tt-chat-direct.py

📝 Create Direct API Chat Script
echo "=== Checking Prerequisites ===" && which docker && docker --version && ls ~/.local/lib/tt-inference-server/run.py && tt-smi -s | python3 -c "import sys,json; d=json.load(sys.stdin); [print(f\

What this does:

What's inside:


Step 3: Start Interactive Chat

Now launch the chat session:

cd ~/tt-metal && \
  export HF_MODEL=~/models/Llama-3.1-8B-Instruct && \
  export PYTHONPATH=$(pwd) && \
  python3 ~/tt-scratchpad/tt-chat-direct.py

💬 Start Direct API Chat

What you'll see:

🔄 Importing tt-metal libraries (this may take a moment)...
📥 Loading model (this will take 2-5 minutes on first run)...
✅ Model loaded and ready!

🤖 Direct API Chat with Llama on Tenstorrent
============================================================
This version loads the model once and keeps it in memory.
After initial load, responses will be much faster!

Commands:
  • Type your prompt and press ENTER
  • Type 'exit' or 'quit' to end
  • Press Ctrl+C to interrupt

>

First run: 2-5 minutes to load (kernel compilation + model loading) Subsequent queries: 1-3 seconds per response!

Step 3: Chat with Your Model

Try asking questions:

> What is machine learning?

🤖 Generating response...

Machine learning is a subset of artificial intelligence (AI) that
involves training algorithms to learn from data and make predictions
or decisions without being explicitly programmed...

------------------------------------------------------------

> Explain transformers in simple terms

🤖 Generating response...

Transformers are a type of neural network architecture that's really
good at understanding relationships in sequential data like text...

------------------------------------------------------------

> exit

👋 Chat session ended

Notice:

Understanding the Code

Open ~/tt-scratchpad/tt-chat-direct.py in your editor (it was opened automatically when you created it). Key sections:

Model Loading (Lines ~80-120)

def prepare_generator(mesh_device, max_batch_size=1, ...):
    # Create the model with optimizations
    model_args, model, tt_kv_cache, _ = create_tt_model(
        mesh_device,
        instruct=True,
        max_batch_size=max_batch_size,
        optimizations=DecodersPrecision.performance,
        paged_attention_config=PagedAttentionConfig(...),
    )

    # Create the generator
    generator = Generator([model], [model_args], mesh_device, ...)

    return generator, model_args, model, ...

This happens once at startup!

Inference (Lines ~125-180)

def generate_response(generator, prompt, max_tokens=128):
    # 1. Tokenize and preprocess
    tokens, encoded, pos, lens = preprocess_inputs_prefill([prompt], ...)

    # 2. Prefill - process the prompt
    logits = generator.prefill_forward_text(tokens, ...)

    # 3. Decode - generate tokens one by one
    for iteration in range(max_tokens):
        logits = generator.decode_forward_text(out_tok, current_pos, ...)
        next_token = sample(logits)
        if next_token is end_token:
            break

    # 4. Decode tokens to text
    response = tokenizer.decode(all_tokens)
    return response

This runs for each query - fast because model is already loaded!

Customization Ideas

Now that you have the code, try modifying it:

1. Change temperature (creativity)

# In generate_response():
response = generate_response(..., temperature=0.7)  # More creative
# vs
response = generate_response(..., temperature=0.0)  # Deterministic

2. Increase max tokens

response = generate_response(..., max_generated_tokens=256)

3. Add streaming output

# Print tokens as they're generated
for iteration in range(max_tokens):
    logits = generator.decode_forward_text(...)
    next_token = sample(logits)
    print(tokenizer.decode([next_token]), end='', flush=True)

4. Multi-turn conversations

# Keep conversation history
conversation_history = []
while True:
    prompt = input("> ")
    conversation_history.append(f"User: {prompt}")
    full_prompt = "\n".join(conversation_history)
    response = generate_response(generator, full_prompt, ...)
    conversation_history.append(f"Assistant: {response}")

Performance Notes

What makes it fast:

Troubleshooting

Import errors:

export PYTHONPATH=~/tt-metal

MESH_DEVICE errors:

# Let tt-metal auto-detect (default behavior)
# Or explicitly set:
export MESH_DEVICE=N150  # or N300, T3K, etc.

Out of memory:

Slow first query:

What You Learned

Key takeaway: Real AI applications load the model once and reuse it. This is the foundation for everything from chat apps to API servers.

What's Next?

Now that you can chat interactively, let's wrap this in an HTTP API so you can:

Continue to Lesson 5: HTTP API Server!

Learn More