N150 N300 T3K P100 P150 P300C Galaxy 15 min Blocked

Understanding Custom Training

Welcome to the Custom Training series! This lesson provides a conceptual foundation for understanding how to build and customize AI models on Tenstorrent hardware.

What You'll Learn

What is custom training and when do you need it?
The difference between fine-tuning and training from scratch
How training frameworks work together
The tt-blacksmith approach to model development
When to use tt-train vs tt-blacksmith vs PyTorch

Time: 10-15 minutes | Prerequisites: Basic understanding of machine learning concepts

Custom Training vs Inference

So far in this extension, you've learned how to run pre-trained models (inference). Now you'll learn how to create your own models (training).

Inference (What You've Done)

Load a pre-trained model
Feed it inputs, get outputs
Like using a tool someone else built
Fast, predictable, production-ready

Training (What We'll Build)

Teach a model new behaviors
Adjust billions of parameters
Like building your own custom tool
Slower, requires experimentation, incredibly powerful

Key insight: Training is where the magic happens. A model is just a collection of numbers (weights) until training teaches it what those numbers should be.

Two Paths to Custom Models

Path 1: Fine-Tuning (Lessons CT-2 through CT-6)

Start with a pre-trained model, teach it something new.

When to use:

You want to specialize an existing model
You have a specific task or domain
You have 100-10,000 examples
You want results in hours, not days

Example: Take TinyLlama (general language model) and fine-tune it to explain machine learning concepts in creative ways.

Analogy: Like hiring an experienced developer and training them on your company's codebase.

Path 2: Training from Scratch (Lessons CT-7 and CT-8)

Build a model from the ground up.

When to use:

You want complete architectural control
You're researching new model designs
You want to deeply understand how models work
You have time and computational resources

Example: Build a tiny transformer (10-20M parameters) that learns language patterns from scratch.

Analogy: Like teaching yourself programming from first principles.

The Training Framework Ecosystem

Tenstorrent's training ecosystem is designed around clarity and modularity. Here's how the pieces fit together:

tt-metal (Foundation)

What it is: Core SDK for Tenstorrent hardware
What it does: Low-level operations, kernels, device management, memory handling
Why it matters: This is the foundation everything else builds on
Location: vendor/tt-metal/

tt-train (Training Framework)

What it is: Python API for training on TT hardware
What it does: PyTorch-like interface, built-in DDP for multi-device training, YAML configuration
Why it matters: Makes training feel familiar to ML engineers while optimizing for TT hardware
Location: vendor/tt-metal/tt-train/

tt-blacksmith (Development Patterns)

What it is: Not just for bounties - it's a development framework
What it does: Config-driven patterns, modular organization, experiment management best practices
Why it matters: Shows you how experienced engineers structure training projects
Location: External reference (we'll apply these patterns throughout)

How they work together:

graph TD
    A[Your Training Script] --> B[tt-train APIHigh-level Training Interface]
    B --> C[tt-metal SDKHardware Operations]
    C --> D[Tenstorrent HardwareN150/N300/T3K/P100/Galaxy]

    E[tt-blacksmith Patterns] -.->|Best PracticesConfig Organization| A

    style A fill:#4A90E2,stroke:#333,stroke-width:2px
    style B fill:#7B68EE,stroke:#333,stroke-width:2px
    style C fill:#7B68EE,stroke:#333,stroke-width:2px
    style D fill:#50C878,stroke:#333,stroke-width:2px
    style E fill:#6C757D,stroke:#333,stroke-width:2px,stroke-dasharray: 5 5

Think of it like web development:

tt-metal = Browser APIs (low-level)
tt-train = React/Vue (framework)
tt-blacksmith = Design patterns & best practices
Your script = Your application

The tt-blacksmith Philosophy

tt-blacksmith isn't just a collection of bounty scripts - it's a framework for making things work on Tenstorrent hardware. Here are its key patterns:

1. Configuration-Driven Everything

Instead of hardcoding values, use YAML configs:

training_config:
  batch_size: 8
  learning_rate: 1e-4
  num_epochs: 3

device_config:
  enable_ddp: False    # Single device
  mesh_shape: [1, 1]

logging_config:
  use_wandb: false     # Optional experiment tracking
  log_level: "INFO"

Why: Easy to experiment, reproduce, and share configurations.

2. Modular Organization

Separate concerns into focused components:

Dataset handling - Load, validate, format data
Model creation - Architecture definition
Training loop - Forward, backward, optimize
Evaluation - Generate samples, compute metrics

Why: Easier to debug, test, and reuse code.

3. Progressive Enhancement

Start simple, add complexity when needed:

File-based logging → WandB integration
Single device → Multi-device DDP
Fine-tuning → Training from scratch

Why: Learn incrementally, avoid over-engineering.

Understanding the Training Process

Training a model is like teaching through repetition - show examples, measure mistakes, make corrections, repeat. Here's the complete flow:

graph TD
    A[Raw DataText files, datasets] --> B[Prepare DataJSONL format]
    B --> C[Initialize ModelPre-trained OR random weights]

    C --> D{Training LoopMultiple epochs}

    D --> E[Get Batch8-32 examples]
    E --> F[Forward PassModel makes predictions]
    F --> G[Compute LossHow wrong?]
    G --> H[Backward PassCalculate gradients]
    H --> I[Update WeightsOptimizer step]

    I --> J{More Batches?}
    J -->|Yes| E
    J -->|No| K[EvaluationGenerate samples, check quality]

    K --> L[Save CheckpointModel weights + optimizer state]

    L --> M{Continue Training?}
    M -->|Yes, more epochs| D
    M -->|No, training complete| N[DeploymentUse with vLLM for inference]

    style A fill:#4A90E2,stroke:#333,stroke-width:2px
    style B fill:#7B68EE,stroke:#333,stroke-width:2px
    style C fill:#7B68EE,stroke:#333,stroke-width:2px
    style D fill:#E85D75,stroke:#333,stroke-width:3px
    style E fill:#7B68EE,stroke:#333,stroke-width:2px
    style F fill:#7B68EE,stroke:#333,stroke-width:2px
    style G fill:#7B68EE,stroke:#333,stroke-width:2px
    style H fill:#7B68EE,stroke:#333,stroke-width:2px
    style I fill:#7B68EE,stroke:#333,stroke-width:2px
    style K fill:#7B68EE,stroke:#333,stroke-width:2px
    style L fill:#E85D75,stroke:#333,stroke-width:2px
    style N fill:#50C878,stroke:#333,stroke-width:2px

What each step does:

Step 1: Prepare Data

Transform raw text into training format (JSONL with prompt/response pairs). Quality matters more than quantity here.

Step 2: Initialize Model

Either load pre-trained weights (fine-tuning) or start from random numbers (training from scratch). Most of the time, you'll fine-tune.

Step 3: Training Loop (The Core)

This is where learning happens:

Get Batch - Load 8-32 examples from your dataset
Forward Pass - Model makes predictions based on current weights
Compute Loss - Measure how far predictions are from correct answers
Backward Pass - Calculate which direction to adjust each weight
Update Weights - Actually change the model's parameters
Repeat - Do this thousands of times

Think of loss as: A score that goes down as the model gets better. Loss of 2.5 → 1.2 → 0.5 means it's learning.

Step 4: Evaluation

Generate sample outputs to see if the model is improving. This happens every few hundred steps, not every step.

Step 5: Save Checkpoint

Store model weights and training state so you can resume if interrupted or pick the best version later.

Step 6: Deployment

Once training is complete, use your fine-tuned model for inference. Integrate with vLLM (from Lesson 7: vLLM Production) for production serving.

Hardware Considerations

N150 (Single Wormhole Chip)

Perfect for: Fine-tuning small models (1-3B params)
Batch size: 4-8 (conservative for DRAM)
Training time: 1-3 hours typical
What you'll learn: Core concepts, single-device patterns

N300 (Dual Wormhole Chips)

Perfect for: Larger models, faster training
Batch size: 16-32 (distributed across chips)
Training time: 30-60 minutes (2x faster than N150)
What you'll learn: DDP patterns, multi-device coordination

T3K / Blackhole / Galaxy (Advanced)

Perfect for: Large-scale training, experimentation
Batch size: 32+ (highly parallel)
Training time: Minutes for small jobs
What you'll learn: Scaling strategies, tensor parallelism

For this series: We'll focus on N150 (everyone can follow) with N300 examples for scaling.

Training Examples Throughout This Series

This series uses concrete examples to teach transferable principles:

CT-4 (Fine-tuning Basics):

Train NanoGPT on Shakespeare text
See hierarchical learning in action (structure → vocabulary → fluency)
Demonstrates character-level language modeling

CT-7 and CT-8 (Architecture & Training from Scratch):

Build a tiny transformer (10-20M parameters)
Understand every component of the model
Learn to design custom architectures

Why these examples:

Clear learning progression (simple → complex)
Visual results (you can see the model learning)
Transferable to any domain
Work on all hardware (N150 through Galaxy)

The goal: Learn principles you can apply to your custom models and domains.

What You'll Build (Series Overview)

Lessons CT-2 and CT-3: Preparation

Create training datasets (JSONL format)
Write configuration files (YAML)
Understand the pieces before assembly

Lesson CT-4: Your First Training Run

Train NanoGPT on Shakespeare dataset
See progressive learning stages
Monitor training progress
Outcome: Understanding how models learn, working trained model

Lessons CT-5 and CT-6: Scaling Up

Train on multiple devices (DDP)
Track experiments with WandB
Understand performance optimization

Lessons CT-7 and CT-8: Advanced Topics

Understand transformer architecture
Train a tiny model from scratch
See the full picture (10M → 1B+ params)

Common Questions

"Should I fine-tune or train from scratch?"

99% of the time: fine-tune.

Fine-tuning is:

Faster - Hours vs days/weeks
Cheaper - Less compute required
Better - Pre-trained models already understand language
Easier - Fewer hyperparameters to tune

Train from scratch when:

You're researching new architectures
You need complete control
You want to understand the fundamentals
You're building something truly novel

"How much data do I need?"

For fine-tuning:

50-200 examples: Decent results for specific tasks
1,000-10,000 examples: Strong performance
100,000+ examples: Approaching pre-training scale

For training from scratch:

Millions of examples for production models
But 10,000+ examples can teach a tiny model (CT-8)

Quality > Quantity: 200 high-quality examples beat 10,000 mediocre ones.

"Will fine-tuning erase what the model learned?"

No, if done correctly.

Use a low learning rate (1e-4 to 1e-5)
Don't over-train (watch validation loss)
The model retains general knowledge while learning your task

Think of it as: Teaching a PhD new skills, not wiping their memory.

"Can I use this for commercial projects?"

Yes, with caveats:

TinyLlama: Apache 2.0 license (commercial-friendly)
Your fine-tuned model: You own it
Training code: Check tt-metal and tt-train licenses
Hosting: Use tt-inference-server or vLLM (Lesson 7)

Always verify licenses for your specific use case.

Beyond This Lesson: The Custom AI Landscape

You're about to learn how to train custom models - but what will you build with this power? Let's explore the possibilities.

What Developers Have Built on Tenstorrent

Real-world custom models running on TT hardware:

🎯 Domain-Specific Coding Assistants

Python → TTNN translators (convert PyTorch to TT-optimized code)
Hardware description language generators (Verilog patterns)
Code review bots trained on team style guides
API documentation chatbots

📚 Knowledge Specialists

Technical documentation assistants (trained on company wikis)
Research paper summarizers (domain-specific scientific content)
Legal contract analyzers (specialized terminology)
Medical Q&A systems (trained on authorized datasets)

🎨 Creative Applications

Genre-specific writing assistants (sci-fi, technical writing, poetry)
Dialog generators for games or simulations
Educational content creators (explain concepts in multiple styles)
Multilingual translators with domain expertise

🔬 Research & Experimentation

Novel architecture testing (new attention patterns)
Compression experiments (how small can models go?)
Specialized tokenizers (music notation, chemical formulas)
Domain-specific embeddings (protein sequences, geographic data)

Working Within Constraints (N150 Can Do This!)

You don't need massive infrastructure to build something meaningful:

Fine-tune 1-3B models in hours - TinyLlama, Qwen3-0.6B, Gemma-3-1B all work on N150
Deploy with vLLM for production inference - Sub-millisecond latency, thousands of requests/second
Iterate quickly with small datasets - 100-1000 high-quality examples beat 100,000 mediocre ones
Combine multiple specialized models - Build an ensemble of experts, each fine-tuned for specific tasks
Scale when needed - Start on N150, move to N300 for 2x speedup, T3K for 8x, Galaxy for research scale

The magic is in the data and the task definition, not the hardware scale.

Imagine: Your Custom Model Journey

Month 1 (Starting Today):

Learn training fundamentals on N150
Build your first domain-specific model
Deploy with vLLM for internal use
Outcome: Working custom model serving real users

Month 2-3:

Experiment with different base models (Qwen, Gemma, Llama)
Try multi-task fine-tuning (one model, multiple skills)
Scale to N300 for faster iteration
Outcome: Production-ready specialized models

Month 6+:

Train multiple specialized models for different domains
Explore novel architectures (CT-7, CT-8)
Contribute patterns back to tt-blacksmith
Outcome: You're pushing the boundaries of what's possible on TT hardware

From Learning to Leading

This series teaches you:

✅ The techniques (fine-tuning, configuration, multi-device training)
✅ The tools (tt-train, tt-metal, experiment tracking)
✅ The patterns (tt-blacksmith best practices)

But more importantly, it empowers you to:

🚀 Imagine what specialized AI can do for your domain
🛠️ Build custom models that solve real problems
📈 Scale from prototype to production
🌟 Innovate within hardware constraints

The question isn't "Can I train a custom model on Tenstorrent hardware?"

The question is "What will I build first?"

Key Takeaways

✅ Training creates models, inference uses them

✅ Fine-tuning is usually the right choice for custom models

✅ tt-train provides the framework for training on TT hardware

✅ tt-blacksmith shows the patterns for organizing training code

✅ Start with N150, scale to N300+ when needed

✅ Focus on data quality over quantity

✅ Examples in this series teach transferable principles

Next Steps

Lesson CT-2: Dataset Fundamentals

Now that you understand the concepts, it's time to get hands-on. In the next lesson, you'll:

Create your first training dataset (JSONL format)
Validate dataset format
Understand tokenization and batching
See how data flows through training

Estimated time: 15 minutes | Prerequisites: This lesson (CT-1)

Additional Resources

Official Documentation

tt-metal GitHub - Core SDK
tt-train Documentation - Training framework
tt-blacksmith Examples - Framework patterns

Lesson 7: vLLM Production (inference with fine-tuned models)
Lesson 11: tt-forge-fe" target="_blank" rel="noreferrer">TT-Forge (experimental compiler)
Lesson 12: TT-XLA JAX (alternative training framework)

Community

Tenstorrent Discord - Ask questions, share results
GitHub Discussions - Technical discussions

Ready to build your first dataset? Continue to Lesson CT-2: Dataset Fundamentals →