N150 N300 T3K P100 P150 P300C Galaxy 10 min Validated

HTTP API Server with Direct API

Transform your interactive chat into a production-ready HTTP API using the Generator API - perfect for building applications that need fast, reliable inference.

From Chat to API

In Lesson 4, you built an interactive terminal chat. Now you'll wrap that same Generator API pattern in an HTTP server, enabling:

The key advantage: model stays loaded in memory between HTTP requests for fast responses.

What you'll build: A production-grade Flask server with the model loaded once on startup.

Architecture

graph TB
    Clients[HTTP Clients]

    subgraph Flask["Flask Server"]
        Generator[Generator APIstays in memory]
        Chat[POST /chat]
        Health[GET /health]

        Chat --> Generator
        Health -.-> Generator
    end

    Clients <--> Chat
    Clients <--> Health

    style Clients fill:#5347a4,stroke:#fff,color:#fff
    style Generator fill:#3293b2,stroke:#fff,color:#fff
    style Chat fill:#499c8d,stroke:#fff,color:#fff
    style Health fill:#499c8d,stroke:#fff,color:#fff

Performance:


Starting Fresh?

This lesson builds on Lesson 4. If you're jumping here directly, verify your setup:

Quick Prerequisite Checks

# Hardware detected?
tt-smi -s

# tt-metal working?
python3 -c "import ttnn; print('✓ tt-metal ready')"

# Model downloaded (Meta format)?
ls ~/models/Llama-3.1-8B-Instruct/original/consolidated.00.pth

# Dependencies installed?
python3 -c "import pi; print('✓ pi installed')"
python3 -c "import flask; print('✓ flask installed')"

All checks passed? Continue to Step 1 below.

If any checks fail:

No hardware or tt-metal?

No model?

No dependencies?


Prerequisites

Same as Lesson 4:


Step 1: Install Flask (If Not Already Done)

Flask is a lightweight Python web framework:

pip install flask

📦 Install Flask
pip install flask

What this does:

Step 2: Create the API Server Script

This command creates ~/tt-scratchpad/tt-api-server-direct.py:

# Creates the API server with direct Generator API
mkdir -p ~/tt-scratchpad && cp template ~/tt-scratchpad/tt-api-server-direct.py && chmod +x ~/tt-scratchpad/tt-api-server-direct.py

🌐 Create API Server Script

What this does:

What's inside:

Step 3: Start the API Server

Now start the server (this takes 2-5 minutes to load the model):

cd ~/tt-metal && \
  export HF_MODEL=~/models/Llama-3.1-8B-Instruct && \
  export PYTHONPATH=$(pwd) && \
  python3 ~/tt-scratchpad/tt-api-server-direct.py --port 8080

🚀 Start API Server (Direct API)

What you'll see:

🔄 Importing tt-metal libraries and loading model...
   This will take 2-5 minutes on first run...

📥 Initializing Tenstorrent mesh device...
📥 Loading model into memory...
✅ Model loaded successfully!

🌐 Llama API Server (Direct API) on Tenstorrent
============================================================
Model: meta-llama/Llama-3.1-8B-Instruct

🚀 Server ready on http://127.0.0.1:8080

Available endpoints:
  • GET  http://127.0.0.1:8080/health
  • POST http://127.0.0.1:8080/chat

Note: Model is loaded in memory - inference is fast!
      No reloading between requests.

Press CTRL+C to stop the server

 * Running on http://127.0.0.1:8080

The model is now loaded and ready! Leave this terminal open.

Step 4: Test with curl

Open a second terminal and test the API.

Health Check

Verify the server is running and the model is loaded:

curl http://localhost:8080/health

Response:

{
  "status": "healthy",
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "model_loaded": true,
  "note": "Model is loaded in memory for fast inference"
}

Basic Inference Query

Send your first prompt (notice how fast it is!):

curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is machine learning?"}'

💬 Test: Basic Inference

Response:

{
  "prompt": "What is machine learning?",
  "response": "Machine learning is a subset of artificial intelligence...",
  "tokens_generated": 45,
  "time_seconds": 1.23,
  "tokens_per_second": 36.6
}

Notice: Only 1-3 seconds! The model was already loaded.

Try Multiple Queries

Send several requests to see the speed:

# Question about AI
curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain neural networks in simple terms"}'

# Creative writing
curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Write a haiku about programming"}'

# Technical question
curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What are transformers in AI?"}'

🔄 Test: Multiple Queries

Each request takes ~1-3 seconds because the model stays loaded!

Custom Parameters

Control generation with optional parameters:

# Longer response
curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Explain quantum computing",
    "max_tokens": 256
  }'

# More creative (higher temperature)
curl -X POST http://localhost:8080/chat \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a creative story",
    "max_tokens": 200,
    "temperature": 0.7
  }'

API Reference

POST /chat

Generates text using the loaded model.

Request Body (JSON):

{
  "prompt": "Your question here",
  "max_tokens": 128,        // Optional, default: 128
  "temperature": 0.0        // Optional, default: 0.0 (greedy)
}

Response (JSON):

{
  "prompt": "Your question here",
  "response": "Generated response...",
  "tokens_generated": 45,
  "time_seconds": 1.23,
  "tokens_per_second": 36.6
}

Parameters:

Error Response:

{
  "error": "Error message here"
}

GET /health

Health check endpoint.

Response (JSON):

{
  "status": "healthy",
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "model_loaded": true,
  "note": "Model is loaded in memory for fast inference"
}

Using from Python

Query the API from Python scripts:

import requests

response = requests.post(
    "http://localhost:8080/chat",
    json={
        "prompt": "What is machine learning?",
        "max_tokens": 128,
        "temperature": 0.0
    }
)

data = response.json()
print(f"Response: {data['response']}")
print(f"Speed: {data['tokens_per_second']:.1f} tokens/sec")

Using from JavaScript

Query from a web application:

fetch('http://localhost:8080/chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    prompt: 'What is machine learning?',
    max_tokens: 128
  })
})
.then(res => res.json())
.then(data => {
  console.log('Response:', data.response);
  console.log('Speed:', data.tokens_per_second, 'tok/s');
});

Understanding the Code

Open ~/tt-scratchpad/tt-api-server-direct.py in your editor (it opened automatically). Key sections:

Initialization (Lines ~80-135)

def initialize_model():
    """Load model once at startup"""
    global GENERATOR, MODEL_ARGS, ...

    # Open mesh device
    MESH_DEVICE = ttnn.open_mesh_device(...)

    # Create model
    MODEL_ARGS, MODEL, TT_KV_CACHE, _ = create_tt_model(
        MESH_DEVICE,
        instruct=True,
        optimizations=DecodersPrecision.performance,
        ...
    )

    # Create generator
    GENERATOR = Generator([MODEL], [MODEL_ARGS], MESH_DEVICE, ...)

This runs once when the server starts!

Request Handling (Lines ~140-200)

def generate_response(prompt, max_tokens=128, temperature=0.0):
    """Use the loaded model for fast inference"""
    # Preprocess
    tokens, encoded, pos, lens = preprocess_inputs_prefill([prompt], ...)

    # Prefill and decode using the GLOBAL model
    logits = GENERATOR.prefill_forward_text(...)
    for _ in range(max_tokens):
        logits = GENERATOR.decode_forward_text(...)

    return response, tokens_generated

This runs for each HTTP request - fast!

Flask Routes (Lines ~205-250)

@app.route('/chat', methods=['POST'])
def chat():
    data = request.get_json()
    prompt = data['prompt']
    max_tokens = data.get('max_tokens', 128)
    temperature = data.get('temperature', 0.0)

    response, tokens = generate_response(prompt, max_tokens, temperature)

    return jsonify({
        "prompt": prompt,
        "response": response,
        "tokens_generated": tokens,
        ...
    })

Performance Metrics

Watch the server logs to see real-time performance:

📝 Request: What is machine learning?...
✓ Generated 45 tokens in 1.23s (36.6 tok/s)

📝 Request: Explain neural networks...
✓ Generated 52 tokens in 1.41s (36.9 tok/s)

Typical performance:

Customization Ideas

1. Add authentication

@app.route('/chat', methods=['POST'])
def chat():
    auth_token = request.headers.get('Authorization')
    if auth_token != 'Bearer your-secret-token':
        return jsonify({"error": "Unauthorized"}), 401
    ...

2. Rate limiting

from flask_limiter import Limiter

limiter = Limiter(app, default_limits=["10 per minute"])

@app.route('/chat', methods=['POST'])
@limiter.limit("5 per minute")
def chat():
    ...

3. Streaming responses

from flask import stream_with_context

@app.route('/chat/stream', methods=['POST'])
def chat_stream():
    def generate():
        for token in generate_tokens(prompt):
            yield f"data: {token}\n\n"

    return Response(stream_with_context(generate()),
                   mimetype='text/event-stream')

4. Request logging

import logging

logging.basicConfig(filename='api.log', level=logging.INFO)

@app.route('/chat', methods=['POST'])
def chat():
    logging.info(f"Request from {request.remote_addr}: {prompt[:50]}")
    ...

Deployment Considerations

For production:

  1. Use a production WSGI server:
pip install gunicorn
gunicorn -w 1 -b 0.0.0.0:8080 tt-api-server-direct:app
  1. Add HTTPS:
# Use nginx or Apache as reverse proxy with SSL
  1. Monitor and scale:
  1. Add proper error handling:

Stopping the Server

To stop the server:

  1. Switch to the server terminal
  2. Press Ctrl+C

The model will unload and cleanup happens automatically.

Troubleshooting

Port already in use:

python3 ~/tt-scratchpad/tt-api-server-direct.py --port 8081

Connection refused:

Slow responses:

Out of memory:

What You Learned

Key takeaway: Production AI APIs load the model once and handle many requests efficiently. This is the foundation for building scalable AI services.

What's Next?

You now have:

Want to go even further? Lesson 6 introduces vLLM - a production-grade inference engine with:

Continue to Lesson 6: Production Inference with vLLM!

Learn More