N300 T3K P100 P150 P300C Galaxy 15 min Blocked

Multi-Device Training

Scale your training to multiple Tenstorrent chips using Data Parallel (DDP) patterns. Learn to train faster while maintaining results quality.

What You'll Learn

Data Parallel (DDP) training fundamentals
Scaling from N150 to N300, T3K, and beyond
Device mesh configuration
Performance optimization
Multi-device debugging

Time: 15 minutes | Prerequisites: CT-4 (Fine-tuning Basics)

Why Multi-Device Training?

Single Device (N150) Limitations

✅ Simple, easy to debug
⚠️ Slower training (1-3 hours)
⚠️ Smaller batch sizes (memory-limited)

Multi-Device (N300+) Benefits

✅ ~2x faster on N300 (30-60 minutes)
✅ ~8x faster on T3K (8 chips)
✅ Larger effective batch sizes
✅ Better hardware utilization
⚠️ Slightly more complex setup

Key insight: With proper configuration, multi-device training produces identical results to single-device, just faster.

Data Parallel (DDP) Explained

How DDP Works

Data Parallel training splits your batch across multiple devices, processes in parallel, then synchronizes. Here's the visual flow:

graph TD
    A[Batch: 16 samples] --> B[Split Batch]

    B --> C[Device 08 samples]
    B --> D[Device 18 samples]

    C --> E[Forward PassDevice 0]
    D --> F[Forward PassDevice 1]

    E --> G[Compute Loss 0]
    F --> H[Compute Loss 1]

    G --> I[Backward PassGradients 0]
    H --> J[Backward PassGradients 1]

    I --> K[All-ReduceAverage Gradients]
    J --> K

    K --> L[Device 0Update Weights]
    K --> M[Device 1Update Weights]

    L --> N[Weights SynchronizedBoth devices identical]
    M --> N

    style A fill:#4A90E2,stroke:#333,stroke-width:2px
    style B fill:#7B68EE,stroke:#333,stroke-width:2px
    style C fill:#7B68EE,stroke:#333,stroke-width:2px
    style D fill:#7B68EE,stroke:#333,stroke-width:2px
    style K fill:#E85D75,stroke:#333,stroke-width:3px
    style N fill:#50C878,stroke:#333,stroke-width:2px

Single vs Multi-Device comparison:

Step	Single Device (N150)	Multi-Device DDP (N300)
Input	Batch of 8	Batch of 16 (split 8+8)
Forward	Device 0 processes all	Both devices in parallel
Backward	Calculate gradients	Calculate gradients in parallel
Sync	No sync needed	All-reduce averages gradients
Update	Update weights	Both devices update identically
Time	1.0x	~0.5x (2x faster)

Key insight: The all-reduce synchronization is the "magic" that keeps devices in sync while processing different data.

Key points:

Each device processes a portion of the batch
Gradients are averaged across devices (all-reduce operation)
All devices stay in sync (identical weights after update)
Training is parallelized (faster throughput)
Results match single-device training (if configured correctly)

When to Use DDP

Use DDP when:

✅ You have N300 (2 chips) or T3K (8 chips)
✅ You want faster iteration
✅ Your model fits on one device (we're not doing model parallelism)

Skip DDP when:

⚠️ You only have N150 (single chip)
⚠️ Debugging training issues (simpler to debug on 1 device)
⚠️ Very small datasets (overhead not worth it)

Configuration Changes for DDP

N150 (Single Device) - Baseline

training_config:
  batch_size: 8
  gradient_accumulation_steps: 4
  # Effective batch: 8 × 4 = 32

device_config:
  enable_ddp: False
  mesh_shape: [1, 1]               # 1 device

N300 (Dual Chips) - DDP Enabled

training_config:
  batch_size: 16                   # 2x larger (split across devices)
  gradient_accumulation_steps: 2   # Reduced (same effective batch)
  # Effective batch: 16 × 2 = 32 (same as N150!)

device_config:
  enable_ddp: True                 # Enable DDP
  mesh_shape: [1, 2]               # 1 row × 2 columns = 2 devices

What changed:

batch_size doubled (16 instead of 8)
gradient_accumulation_steps halved (2 instead of 4)
enable_ddp: True
mesh_shape: [1, 2] (two devices)

Key principle: Keep batch_size × gradient_accumulation_steps constant for fair comparison.

Training on N300 with DDP

Step 1: Verify Hardware

Check that both chips are detected:

tt-smi

Expected output:

Device 0: Wormhole (N300)
Device 1: Wormhole (N300)

Step 2: Launch Training

To start multi-device training:

cd ~/tt-scratchpad/training
python train.py --config configs/training_n300.yaml

What this does:

Loads configs/training_n300.yaml (with DDP configuration)
Initializes both devices in the mesh
Launches training with DDP enabled across all devices

Step 3: Monitor DDP Training

Initial setup:

🎯 Custom Training
============================================================

Loading config: configs/training_n300.yaml
Initializing 2 devices...                    # ← DDP initialization
Device mesh: [1, 2]                          # ← 2 devices configured
Creating model...
Loading weights from ~/models/tinyllama_safetensors
Loaded 50 examples from my_dataset.jsonl

Training configuration:
  Devices: 2                                 # ← DDP active
  Batch size: 16 (per-device: 8)             # ← Split across devices
  Gradient accumulation: 2
  Effective batch size: 32

Training progress:

Training:  20%|████▌                   | 100/500 [00:08<00:32, 3.1 it/s, loss=2.12]

Notice: 3.1 it/s (iterations per second) should be ~2x higher than N150.

Performance Comparison

Expected Speedup

Hardware	Devices	Batch Size	Training Time	Speedup
N150	1	8	1.5-3 hours	1x (baseline)
N300	2	16	45-90 min	~2x
T3K	8	64	15-30 min	~6-8x

Why not perfect linear scaling?

Communication overhead (gradient synchronization)
Batch size scaling (larger batches → fewer steps → less benefit)
Hardware utilization (not all operations parallelize perfectly)

Real-world: Expect 1.8-2.0x speedup on N300, 6-7x on T3K.

Advanced: T3K and Galaxy

T3K Configuration (8 Devices)

training_config:
  batch_size: 64                   # 8x larger
  gradient_accumulation_steps: 1   # No accumulation needed
  # Effective batch: 64 × 1 = 64

device_config:
  enable_ddp: True
  mesh_shape: [2, 4]               # 2 rows × 4 columns = 8 devices

Device Mesh Visualization:

graph TD
    subgraph N150["N150 (Single Chip)"]
        A1[Device 0]
    end

    subgraph N300["N300 (Dual Chip)"]
        B1[Device 0] --- B2[Device 1]
    end

    subgraph T3K["T3K (8 Chips, 2x4 Mesh)"]
        C1[Dev 0] --- C2[Dev 1] --- C3[Dev 2] --- C4[Dev 3]
        C5[Dev 4] --- C6[Dev 5] --- C7[Dev 6] --- C8[Dev 7]
        C1 --- C5
        C2 --- C6
        C3 --- C7
        C4 --- C8
    end

    subgraph Galaxy["Galaxy (32+ Chips)"]
        D1[4x8 mesh = 32 chips]
    end

    style A1 fill:#4A90E2,stroke:#333,stroke-width:2px
    style B1 fill:#7B68EE,stroke:#333,stroke-width:2px
    style B2 fill:#7B68EE,stroke:#333,stroke-width:2px
    style C1 fill:#50C878,stroke:#333,stroke-width:1px
    style C2 fill:#50C878,stroke:#333,stroke-width:1px
    style C3 fill:#50C878,stroke:#333,stroke-width:1px
    style C4 fill:#50C878,stroke:#333,stroke-width:1px
    style C5 fill:#50C878,stroke:#333,stroke-width:1px
    style C6 fill:#50C878,stroke:#333,stroke-width:1px
    style C7 fill:#50C878,stroke:#333,stroke-width:1px
    style C8 fill:#50C878,stroke:#333,stroke-width:1px
    style D1 fill:#E85D75,stroke:#333,stroke-width:2px

Mesh shape explained:

[1, 1] = 1 row × 1 column = 1 device (N150)
[1, 2] = 1 row × 2 columns = 2 devices (N300)
[2, 4] = 2 rows × 4 columns = 8 devices (T3K)
[4, 8] = 4 rows × 8 columns = 32 devices (Galaxy)

Trade-offs:

✅ Much faster training (~6-8x speedup)
⚠️ Larger effective batch (may need LR adjustment)
⚠️ More communication overhead

LR scaling rule: If you scale batch size by N, consider scaling LR by √N.

Example: Batch 32 → 64 (2x), try LR 1e-4 → 1.4e-4 (√2 ≈ 1.4x)

Galaxy Configuration (32+ Devices)

device_config:
  enable_ddp: True
  mesh_shape: [4, 8]               # 32 devices (4 rows × 8 columns)

Use cases:

Large-scale training (billions of parameters)
Research experiments (fast iteration)
Production training pipelines

Note: Galaxy-scale training requires careful hyperparameter tuning and is beyond the scope of this intro lesson.

Troubleshooting Multi-Device Issues

Issue 1: DDP Initialization Fails

Symptoms:

RuntimeError: Failed to initialize DDP
Device 1 not found

Fixes:

Check tt-smi - are all devices detected?
Restart devices: tt-smi -r all
Check mesh_shape matches available devices
Verify no other processes using devices

Issue 2: Gradients Not Synchronizing

Symptoms:

Devices show different loss values
Training diverges
Inconsistent results

Fixes:

Verify enable_ddp: True in config
Check gradient synchronization logs
Ensure all devices running same code version
Profile with ttnn.profiler

Issue 3: Performance Not Scaling

Symptoms:

N300 training is only 1.2x faster (not 2x)
Low device utilization

Possible causes:

Batch size too small (increase if memory allows)
Communication bottleneck (check network)
Unbalanced workload (check per-device metrics)

Fixes:

Increase batch size to utilize devices fully
Profile communication overhead
Check device memory utilization
Adjust gradient accumulation

Issue 4: OOM with Larger Batch

Symptoms:

RuntimeError: Device out of memory

Fixes:

Reduce batch_size (try 12 instead of 16)
Increase gradient_accumulation_steps
Check that batch is properly split across devices
Verify device memory with tt-smi -m

DDP Best Practices

1. Keep Effective Batch Constant

When scaling devices, adjust batch_size and gradient_accumulation_steps to maintain:

effective_batch = batch_size × gradient_accumulation_steps × num_devices

Example:

N150: 8 × 4 × 1 = 32
N300: 16 × 2 × 2 = 64  # Oops, doubled effective batch!

Better N300: 8 × 2 × 2 = 32  # Same effective batch

2. Validate Results Match

After DDP training, verify that:

✅ Final loss similar to single-device
✅ Model quality similar (test on same examples)
✅ Training curves look similar (scaled by speedup)

If results differ significantly:

Check learning rate (may need adjustment)
Verify gradient synchronization working
Compare checkpoints at same effective step

3. Monitor Per-Device Metrics

Use logging to track:

Per-device loss
Memory usage per device
Communication time vs compute time

Tools:

tt-smi - Real-time device monitoring
ttnn.profiler - Performance profiling
WandB (CT-6) - Multi-run comparison

4. Start Small, Scale Up

Recommended progression:

Debug on N150 (single device)
Validate on N300 (2 devices)
Scale to T3K (8 devices) when ready
Consider Galaxy for production

Why: Easier to debug on fewer devices, then scale with confidence.

Gradient Synchronization Deep Dive

What Gets Synchronized?

After each backward pass:

Each device computes local gradients
All-reduce operation averages gradients across devices
Each device gets the averaged gradient
Optimizer updates weights using averaged gradient

Communication Patterns

Ring All-Reduce (efficient for large models):

Device 0 ←→ Device 1 ←→ ... ←→ Device N

Why it matters:

Large models → more gradients → more communication
Communication time should be < compute time
Network bandwidth matters for multi-node setups

Profiling Communication

# In training script (advanced)
import ttnn

with ttnn.profile() as prof:
    # Training step
    loss.backward()
    optim.step()

# Analyze communication vs compute time
print(prof.summary())

Ideal ratio: Communication < 10% of total time.

Scaling Your Ambitions: From Prototype to Production

You've learned the mechanics of multi-device training. But what does scaling really enable? Let's explore how multi-device training transforms what you can build.

The Scaling Journey

Week 1: Prototype on N150

Build your model concept
Validate with 50-200 training examples
Training time: 1-3 hours per experiment
Goal: Prove the concept works

Week 2: Iterate on N300

2x faster iteration (30-90 min per experiment)
Run 3-5 experiments per day instead of 1-2
Test multiple hyperparameter configurations
Goal: Find optimal configuration

Month 2: Scale on T3K

6-8x faster training (10-20 min per experiment)
Train on larger datasets (1000+ examples)
Multi-task learning (multiple skills in one model)
Goal: Production-ready models

Production: Deploy with confidence

Models validated on multiple hardware configurations
Proven performance characteristics
Scalable training pipeline
Goal: Serve real users

Real-World Scaling Success Stories

🚀 "Code Review Bot" (Startup → Enterprise)

N150 phase: Trained on 100 team PRs, 2-hour iterations
N300 phase: Expanded to 500 PRs, tested 10 prompt variations in a day
T3K phase: Full company history (5000+ PRs), multi-task (style + security + performance)
Impact: From team tool (10 devs) → company standard (200+ devs)
Training time: 3 hours → 90 min → 15 min per full model

💼 "Legal Document Generator" (Consulting → SaaS)

N150 phase: 50 contract templates, proved concept
N300 phase: 200 templates across 3 practice areas, found winning config
T3K phase: 1000+ examples, 10 specialized models (corporate, IP, employment, etc.)
Impact: Consultancy internal tool → multi-tenant SaaS product
Revenue: $0 → $50k MRR from faster iteration

🎮 "Game NPC Dialogue" (Indie → AAA)

N150 phase: Single character archetype (100 dialogue lines)
N300 phase: 5 character types, varied personalities
T3K phase: 50+ unique NPCs, context-aware responses
Impact: Hand-written dialogues → AI-augmented content at scale
Cost savings: $100k+ in writing/voice acting budget

🏥 "Medical Report Assistant" (Research → Clinical)

N150 phase: Single specialty (dermatology), 100 report examples
N300 phase: 3 specialties, validation by clinicians
T3K phase: 10+ specialties, multi-lingual support
Impact: Research project → deployed in 20+ hospitals
Time saved: 30 min/report → 5 min/report (doctors can see more patients)

What Multi-Device Training Really Gives You

It's not just about speed. It's about:

✨ Experimentation velocity

N150: Try 1-2 ideas per day
N300: Try 5-10 ideas per day
T3K: Try 20-30 ideas per day
Result: Find winning approaches 10x faster

🎯 Dataset scale

N150: Validate with 50-200 examples
N300: Train on 500-1000 examples
T3K: Handle 10,000+ examples
Result: Better models from more data

🚀 Model complexity

N150: Single-task models
N300: Multi-task learning
T3K: Ensemble of specialists
Result: More capable, versatile models

💰 Economic viability

Prototype on N150: Low upfront cost
Prove value before scaling: Validate before investing
Scale to T3K when revenue justifies: Grow hardware with business
Result: Sustainable business model

Your Multi-Device Roadmap

Month 1 (N150 - Learning):

Master single-device training
Build intuition for hyperparameters
Create baseline model
Investment: N150 hardware, your time

Month 2 (N300 - Optimizing):

2x faster iteration unlocks experimentation
Test architectural variations
Expand dataset strategically
Investment: N300 hardware (~2x N150 cost)

Month 3+ (T3K - Scaling):

Production-quality models in hours
Multiple models for different use cases
Continuous improvement pipeline
Investment: T3K hardware, justified by production value

Production (Right-sized hardware):

Training pipeline optimized for your scale
Deploy on hardware that matches your needs
Continuous retraining as data grows
ROI: Revenue/savings >> hardware costs

The Power Law of Training Scale

Here's what most developers don't realize:

1x hardware (N150) = Good for learning and prototypes
2x hardware (N300) = 4x more experiments (because iteration is faster, you try more)
8x hardware (T3K) = 30x more experiments (speed enables entirely different workflows)

Why the multiplier effect?

Faster training → More courage to experiment
More experiments → Better intuition
Better intuition → Smarter choices
Smarter choices → Faster progress

It's not linear. It's exponential.

From Learning to Leading

You now understand:

✅ How DDP works (gradient synchronization, device meshes)
✅ How to configure for different hardware (N150 → N300 → T3K → Galaxy)
✅ How to debug multi-device issues (synchronization, performance, memory)
✅ How scaling enables exponentially more experimentation

The question isn't "Should I scale to multi-device?"

The question is "How fast do I want to iterate and learn?"

N150: Learn fundamentals, prove concepts (essential first step)
N300: Iterate 2x faster, find winning approaches (when you're serious)
T3K: Move at production speed, build real products (when you're committed)
Galaxy: Research-scale innovation, push boundaries (when you're leading)

Start where you are. Scale when you're ready. The path is clear.

Key Takeaways

✅ DDP scales training to multiple devices efficiently

✅ N300 provides ~2x speedup over N150

✅ Keep effective batch size constant for fair comparison

✅ Gradient synchronization ensures all devices stay in sync

✅ Start with single device, scale up after validation

✅ Monitor per-device metrics to catch issues early

Next Steps

Lesson CT-6: Experiment Tracking

You've learned to train on single and multiple devices. Next, learn to track and compare experiments:

WandB integration for experiment tracking
Compare hyperparameter variations
Visualize training curves
Share results with team

Estimated time: 10-15 minutes Prerequisites: CT-4, CT-5

Or skip to:

Lesson CT-7: Model Architecture Basics

Understand transformer components before training from scratch.

Additional Resources

Documentation

DDP in PyTorch - Conceptual foundation
tt-train DDP - TT implementation
Efficient DDP - Research paper

Configuration Examples

tt-train examples: Check tt-metal/tt-train/sources/examples/ for multi-device configs
DDP patterns: Reference tt-metal documentation for device mesh configuration

Profiling Tools

tt-smi - Device monitoring
ttnn.profiler - Performance analysis

Ready to track your experiments? Continue to Lesson CT-6: Experiment Tracking →