N300 P100 P150 Galaxy 10 min Validated

Deploy Your Work to Koyeb

Deploy any Python application to Koyeb with Tenstorrent N300 hardware access. We'll use vLLM as the primary example, then show how to adapt for any application.

What You'll Learn

Deploy vLLM to production with N300 hardware
Containerize Python applications for Tenstorrent
Configure hardware access and permissions
Production deployment best practices
Adapt the pattern for any application

Prerequisites

Completed Deploy tt-vscode-toolkit to Koyeb (recommended)
Koyeb CLI installed and authenticated
Docker or Podman installed locally
Completed vLLM Production lesson (for Part 1)

Part 1: Deploy vLLM to Koyeb

Step 1: Review Your vLLM Setup

From Lesson 7 (vLLM Production), you learned to run:

python -m vllm.entrypoints.openai.api_server \
  --model ~/models/Qwen3-0.6B \
  --served-model-name Qwen/Qwen3-0.6B \
  --port 8000

Now we'll deploy this to Koyeb with N300 hardware.

Step 2: Create vLLM Dockerfile

Instead of building everything from scratch, extend our published image:

Dockerfile.vllm:

# Base on Tenstorrent's published image
FROM ghcr.io/tenstorrent/tt-vscode-toolkit:latest

# Switch to root to install additional packages (if needed)
USER root
# RUN apt-get update && apt-get install -y \
#     your-dependencies-here \
#     && rm -rf /var/lib/apt/lists/*

# Switch back to coder user
USER coder
WORKDIR /home/coder

# Install vLLM
RUN git clone https://github.com/tenstorrent/vllm.git && \
    cd vllm && \
    python3 -m venv vllm-env && \
    . vllm-env/bin/activate && \
    pip install -e .

# Download your model using pre-installed HuggingFace CLI
RUN mkdir -p models && \
    hf download Qwen/Qwen3-0.6B --local-dir models/Qwen3-0.6B

# Environment variables for your app
ENV MODEL_PATH=/home/coder/models/Qwen3-0.6B

# Expose your app's port
EXPOSE 8000

# Run your app
CMD ["/bin/bash", "-c", "source vllm/vllm-env/bin/activate && python -m vllm.entrypoints.openai.api_server --model ${MODEL_PATH} --served-model-name Qwen/Qwen3-0.6B --port 8000 --host 0.0.0.0"]

Benefits:

✅ 50% fewer lines (was ~60, now ~30)
✅ No need to set up base system (Ubuntu, apt repos, users, permissions)
✅ HuggingFace CLI (hf) pre-installed
✅ Tenstorrent tools pre-installed (via tt-installer: tt-smi, tt-flash, etc.)
✅ All hardware permissions configured
✅ Just add your app!

Step 3: Deploy to Koyeb

koyeb deploy . my-app/vllm \
  --archive-builder docker \
  --archive-docker-dockerfile Dockerfile.vllm \
  --ports 8000:http \
  --routes /:8000 \
  --env MESH_DEVICE=N300 \
  --regions na \
  --instance-type gpu-tenstorrent-n300s \
  --privileged

Deployment time: 10-15 minutes (builds vLLM + downloads model)

Cost optimization tip: Pre-build the Docker image locally and push to a registry to reduce deployment time:

# Build locally
docker build -t registry.koyeb.com/yourorg/vllm:latest -f Dockerfile.vllm .

# Push to registry
docker push registry.koyeb.com/yourorg/vllm:latest

# Deploy from registry (much faster!)
koyeb services create vllm \
  --app my-app \
  --docker registry.koyeb.com/yourorg/vllm:latest \
  --ports 8000:http \
  --routes /:8000 \
  --env MESH_DEVICE=N300 \
  --regions na \
  --instance-type gpu-tenstorrent-n300s \
  --privileged

Step 4: Test Your vLLM Deployment

Get your service URL:

koyeb services get vllm

Test with curl:

curl https://vllm-<your-hash>.koyeb.app/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "prompt": "Explain Tenstorrent hardware in one sentence:",
    "max_tokens": 50
  }'

Or use the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    base_url="https://vllm-<your-hash>.koyeb.app/v1",
    api_key="not-needed"  # vLLM doesn't require auth by default
)

response = client.completions.create(
    model="Qwen/Qwen3-0.6B",
    prompt="What is Tenstorrent?",
    max_tokens=50
)

print(response.choices[0].text)

Part 2: Deploy Any Python Application

General Pattern

The Dockerfile pattern works for any Python application:

Base image: ghcr.io/tenstorrent/tt-vscode-toolkit:latest (includes tt-smi, permissions, CLIs)
Install your app: Clone repo, install dependencies
Set environment: Any additional environment variables your app needs
Expose ports: Your app's port
Deploy with: --privileged and gpu-tenstorrent-n300s

Benefits of using our base image:

✅ tt-smi pre-installed
✅ HuggingFace CLI (hf) and Claude CLI (claude) ready to use
✅ Hardware permissions configured (video, render groups)
✅ MOTD system for helpful terminal messages
✅ Faster builds (only your app layer)
✅ Clean base for your applications

Example: Custom Inference Server

Let's say you built a custom Flask API in Lesson 10:

Dockerfile.custom:

# Base on Tenstorrent's published image
FROM ghcr.io/tenstorrent/tt-vscode-toolkit:latest

USER coder
WORKDIR /home/coder

# Copy your application
COPY --chown=coder:coder . /home/coder/app

# Install your dependencies
RUN cd app && \
    python3 -m venv venv && \
    . venv/bin/activate && \
    pip install -r requirements.txt

# Expose your port
EXPOSE 5000

# Run your app
CMD ["/bin/bash", "-c", "cd app && source venv/bin/activate && python server.py"]

Much simpler! From 80 lines to 15 lines.

Deploy:

koyeb deploy . my-app/inference \
  --archive-builder docker \
  --archive-docker-dockerfile Dockerfile.custom \
  --ports 5000:http \
  --routes /:5000 \
  --env MESH_DEVICE=N300 \
  --regions na \
  --instance-type gpu-tenstorrent-n300s \
  --privileged

Example: Data Processing Pipeline

For batch processing (not a server):

Dockerfile.batch:

# Base on Tenstorrent's published image
FROM ghcr.io/tenstorrent/tt-vscode-toolkit:latest

USER coder
WORKDIR /home/coder

# Your processing script
COPY --chown=coder:coder process.py /home/coder/

# Install dependencies
RUN python3 -m venv venv && \
    . venv/bin/activate && \
    pip install torch ttnn numpy

# Run processing script
CMD ["/bin/bash", "-c", "source venv/bin/activate && python process.py"]

Even simpler! Just 13 lines total.

This runs once per deployment. For scheduled tasks, combine with Koyeb's job scheduling.

Part 3: Production Considerations

Scaling

Auto-scaling configuration:

koyeb services create vllm \
  --app my-app \
  --docker registry.koyeb.com/yourorg/vllm:latest \
  --min-scale 1 \
  --max-scale 3 \
  --autoscaling-average-cpu 70 \
  --ports 8000:http \
  --instance-type gpu-tenstorrent-n300s \
  --privileged

Multiple regions:

--regions na,fra  # Deploy to US and Europe

Load balancing: Automatic across all instances

Monitoring

Check service health:

koyeb services get vllm

View logs:

koyeb services logs vllm -f

Metrics: Available in Koyeb dashboard

Request rate
Response time
CPU/Memory usage
Hardware utilization

Health Checks

Add health check endpoints to your application:

# In your Flask/FastAPI app
@app.get("/health")
def health():
    return {"status": "healthy"}

@app.get("/readiness")
def readiness():
    # Check if model is loaded, hardware accessible, etc.
    if model_loaded and hardware_ok:
        return {"status": "ready"}
    return {"status": "not ready"}, 503

Configure in Koyeb:

koyeb services create vllm \
  --app my-app \
  --docker registry.koyeb.com/yourorg/vllm:latest \
  --checks "8000:http:/health" \
  --ports 8000:http \
  --instance-type gpu-tenstorrent-n300s \
  --privileged

Security

API Authentication:

Add authentication to your application:

from fastapi import FastAPI, Header, HTTPException

app = FastAPI()

@app.get("/v1/completions")
def completions(authorization: str = Header(None)):
    if authorization != f"Bearer {os.getenv('API_KEY')}":
        raise HTTPException(status_code=401, detail="Unauthorized")
    # ... your logic ...

Set API key via environment variable:

koyeb services update vllm \
  --env API_KEY=your-secret-key

Network isolation: Koyeb services are isolated by default. Use private networking for service-to-service communication.

Part 4: CI/CD Integration

GitHub Actions Example

.github/workflows/deploy-koyeb.yml:

name: Deploy to Koyeb

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install Koyeb CLI
        run: curl -fsSL https://cli.koyeb.com/install.sh | sh

      - name: Build and push Docker image
        env:
          KOYEB_TOKEN: ${{ secrets.KOYEB_TOKEN }}
        run: |
          docker build -t registry.koyeb.com/${{ github.repository_owner }}/vllm:${{ github.sha }} -f Dockerfile.vllm .
          echo "$KOYEB_TOKEN" | docker login registry.koyeb.com -u ${{ github.repository_owner }} --password-stdin
          docker push registry.koyeb.com/${{ github.repository_owner }}/vllm:${{ github.sha }}

      - name: Deploy to Koyeb
        env:
          KOYEB_TOKEN: ${{ secrets.KOYEB_TOKEN }}
        run: |
          koyeb services update vllm \
            --docker-image registry.koyeb.com/${{ github.repository_owner }}/vllm:${{ github.sha }}

Setup:

Get API token: https://app.koyeb.com/account/api
Add to GitHub Secrets: KOYEB_TOKEN
Push to main branch → automatic deployment

Troubleshooting

Hardware Not Accessible

Error: Permission denied: /dev/tenstorrent/0

Solution: Ensure --privileged flag is set:

koyeb services update vllm --privileged

And verify user is in correct groups in Dockerfile:

RUN usermod -aG video appuser && \
    groupadd -f render && \
    usermod -aG render appuser

Build Timeouts

Error: Build exceeds time limit

Solutions:

Pre-build images: Build locally, push to registry
Reduce dependencies: Only install what you need
Use build cache: Koyeb caches Docker layers
Split into stages: Multi-stage Docker builds

Model Download Fails

Error: HuggingFace download timeout

Solutions:

Pre-download in image: Include model in Docker image
Use registry: Push image with model pre-downloaded
Increase timeout: Use --health-checks-grace-period 300

Cost Optimization

Tips:

Use smaller instance types for testing (small instead of gpu-tenstorrent-n300s)
Delete services when not in use
Use registry-based deployment to avoid rebuilds
Set up auto-scaling to scale down during low usage

Example cost-effective setup:

# Development (no hardware)
koyeb services create vllm-dev \
  --docker registry.koyeb.com/yourorg/vllm:latest \
  --instance-type small

# Production (with hardware, auto-scales)
koyeb services create vllm-prod \
  --docker registry.koyeb.com/yourorg/vllm:latest \
  --instance-type gpu-tenstorrent-n300s \
  --min-scale 1 \
  --max-scale 3 \
  --autoscaling-average-cpu 70 \
  --privileged

Summary

What you learned:

✅ Deploy vLLM to production with N300 hardware
✅ Containerize any Python app for Tenstorrent by extending our base image
✅ Simplify Dockerfiles from 80 lines to 15 lines
✅ Set up monitoring and health checks
✅ Integrate with CI/CD pipelines
✅ Optimize for production and cost

Key pattern:

Base image: ghcr.io/tenstorrent/tt-vscode-toolkit:latest
Add your application layer (clone, install, configure)
Deploy with --privileged and gpu-tenstorrent-n300s

Why use the base image:

tt-smi, HuggingFace CLI, Claude CLI pre-installed
Hardware permissions pre-configured
MOTD system for better UX
Much simpler Dockerfiles (15 lines vs 80 lines)
Faster builds (only your app layer)

Resources:

Next Steps

✅ You can now deploy any application with Tenstorrent hardware!

Continue your journey:

🎯 Interactive Chat - Integrate with VSCode Chat
🖼️ Image Generation - Deploy image generation services
🧠 CS Fundamentals - Deep dive into hardware

Share your deployment:

Production APIs running on Tenstorrent hardware
Scalable inference services
Custom AI applications

Your applications now have access to cutting-edge AI acceleration! 🚀