N300 T3K P100 P150 P300C Galaxy 15 min Blocked

Multi-Device Training

Scale your training to multiple Tenstorrent chips using Data Parallel (DDP) patterns. Learn to train faster while maintaining results quality.

What You'll Learn

Time: 15 minutes | Prerequisites: CT-4 (Fine-tuning Basics)


Why Multi-Device Training?

Single Device (N150) Limitations

Multi-Device (N300+) Benefits

Key insight: With proper configuration, multi-device training produces identical results to single-device, just faster.


Data Parallel (DDP) Explained

How DDP Works

Data Parallel training splits your batch across multiple devices, processes in parallel, then synchronizes. Here's the visual flow:

graph TD
    A[Batch: 16 samples] --> B[Split Batch]

    B --> C[Device 08 samples]
    B --> D[Device 18 samples]

    C --> E[Forward PassDevice 0]
    D --> F[Forward PassDevice 1]

    E --> G[Compute Loss 0]
    F --> H[Compute Loss 1]

    G --> I[Backward PassGradients 0]
    H --> J[Backward PassGradients 1]

    I --> K[All-ReduceAverage Gradients]
    J --> K

    K --> L[Device 0Update Weights]
    K --> M[Device 1Update Weights]

    L --> N[Weights SynchronizedBoth devices identical]
    M --> N

    style A fill:#4A90E2,stroke:#333,stroke-width:2px
    style B fill:#7B68EE,stroke:#333,stroke-width:2px
    style C fill:#7B68EE,stroke:#333,stroke-width:2px
    style D fill:#7B68EE,stroke:#333,stroke-width:2px
    style K fill:#E85D75,stroke:#333,stroke-width:3px
    style N fill:#50C878,stroke:#333,stroke-width:2px

Single vs Multi-Device comparison:

Step Single Device (N150) Multi-Device DDP (N300)
Input Batch of 8 Batch of 16 (split 8+8)
Forward Device 0 processes all Both devices in parallel
Backward Calculate gradients Calculate gradients in parallel
Sync No sync needed All-reduce averages gradients
Update Update weights Both devices update identically
Time 1.0x ~0.5x (2x faster)

Key insight: The all-reduce synchronization is the "magic" that keeps devices in sync while processing different data.

Key points:

When to Use DDP

Use DDP when:

Skip DDP when:


Configuration Changes for DDP

N150 (Single Device) - Baseline

training_config:
  batch_size: 8
  gradient_accumulation_steps: 4
  # Effective batch: 8 × 4 = 32

device_config:
  enable_ddp: False
  mesh_shape: [1, 1]               # 1 device

N300 (Dual Chips) - DDP Enabled

training_config:
  batch_size: 16                   # 2x larger (split across devices)
  gradient_accumulation_steps: 2   # Reduced (same effective batch)
  # Effective batch: 16 × 2 = 32 (same as N150!)

device_config:
  enable_ddp: True                 # Enable DDP
  mesh_shape: [1, 2]               # 1 row × 2 columns = 2 devices

What changed:

Key principle: Keep batch_size × gradient_accumulation_steps constant for fair comparison.


Training on N300 with DDP

Step 1: Verify Hardware

Check that both chips are detected:

tt-smi

Expected output:

Device 0: Wormhole (N300)
Device 1: Wormhole (N300)

Step 2: Launch Training

To start multi-device training:

cd ~/tt-scratchpad/training
python train.py --config configs/training_n300.yaml

What this does:

  1. Loads configs/training_n300.yaml (with DDP configuration)
  2. Initializes both devices in the mesh
  3. Launches training with DDP enabled across all devices

Step 3: Monitor DDP Training

Initial setup:

🎯 Custom Training
============================================================

Loading config: configs/training_n300.yaml
Initializing 2 devices...                    # ← DDP initialization
Device mesh: [1, 2]                          # ← 2 devices configured
Creating model...
Loading weights from ~/models/tinyllama_safetensors
Loaded 50 examples from my_dataset.jsonl

Training configuration:
  Devices: 2                                 # ← DDP active
  Batch size: 16 (per-device: 8)             # ← Split across devices
  Gradient accumulation: 2
  Effective batch size: 32

Training progress:

Training:  20%|████▌                   | 100/500 [00:08<00:32, 3.1 it/s, loss=2.12]

Notice: 3.1 it/s (iterations per second) should be ~2x higher than N150.


Performance Comparison

Expected Speedup

Hardware Devices Batch Size Training Time Speedup
N150 1 8 1.5-3 hours 1x (baseline)
N300 2 16 45-90 min ~2x
T3K 8 64 15-30 min ~6-8x

Why not perfect linear scaling?

Real-world: Expect 1.8-2.0x speedup on N300, 6-7x on T3K.


Advanced: T3K and Galaxy

T3K Configuration (8 Devices)

training_config:
  batch_size: 64                   # 8x larger
  gradient_accumulation_steps: 1   # No accumulation needed
  # Effective batch: 64 × 1 = 64

device_config:
  enable_ddp: True
  mesh_shape: [2, 4]               # 2 rows × 4 columns = 8 devices

Device Mesh Visualization:

graph TD
    subgraph N150["N150 (Single Chip)"]
        A1[Device 0]
    end

    subgraph N300["N300 (Dual Chip)"]
        B1[Device 0] --- B2[Device 1]
    end

    subgraph T3K["T3K (8 Chips, 2x4 Mesh)"]
        C1[Dev 0] --- C2[Dev 1] --- C3[Dev 2] --- C4[Dev 3]
        C5[Dev 4] --- C6[Dev 5] --- C7[Dev 6] --- C8[Dev 7]
        C1 --- C5
        C2 --- C6
        C3 --- C7
        C4 --- C8
    end

    subgraph Galaxy["Galaxy (32+ Chips)"]
        D1[4x8 mesh = 32 chips]
    end

    style A1 fill:#4A90E2,stroke:#333,stroke-width:2px
    style B1 fill:#7B68EE,stroke:#333,stroke-width:2px
    style B2 fill:#7B68EE,stroke:#333,stroke-width:2px
    style C1 fill:#50C878,stroke:#333,stroke-width:1px
    style C2 fill:#50C878,stroke:#333,stroke-width:1px
    style C3 fill:#50C878,stroke:#333,stroke-width:1px
    style C4 fill:#50C878,stroke:#333,stroke-width:1px
    style C5 fill:#50C878,stroke:#333,stroke-width:1px
    style C6 fill:#50C878,stroke:#333,stroke-width:1px
    style C7 fill:#50C878,stroke:#333,stroke-width:1px
    style C8 fill:#50C878,stroke:#333,stroke-width:1px
    style D1 fill:#E85D75,stroke:#333,stroke-width:2px

Mesh shape explained:

Trade-offs:

LR scaling rule: If you scale batch size by N, consider scaling LR by √N.

Example: Batch 32 → 64 (2x), try LR 1e-4 → 1.4e-4 (√2 ≈ 1.4x)

Galaxy Configuration (32+ Devices)

device_config:
  enable_ddp: True
  mesh_shape: [4, 8]               # 32 devices (4 rows × 8 columns)

Use cases:

Note: Galaxy-scale training requires careful hyperparameter tuning and is beyond the scope of this intro lesson.


Troubleshooting Multi-Device Issues

Issue 1: DDP Initialization Fails

Symptoms:

RuntimeError: Failed to initialize DDP
Device 1 not found

Fixes:

  1. Check tt-smi - are all devices detected?
  2. Restart devices: tt-smi -r all
  3. Check mesh_shape matches available devices
  4. Verify no other processes using devices

Issue 2: Gradients Not Synchronizing

Symptoms:

Fixes:

  1. Verify enable_ddp: True in config
  2. Check gradient synchronization logs
  3. Ensure all devices running same code version
  4. Profile with ttnn.profiler

Issue 3: Performance Not Scaling

Symptoms:

Possible causes:

Fixes:

  1. Increase batch size to utilize devices fully
  2. Profile communication overhead
  3. Check device memory utilization
  4. Adjust gradient accumulation

Issue 4: OOM with Larger Batch

Symptoms:

RuntimeError: Device out of memory

Fixes:

  1. Reduce batch_size (try 12 instead of 16)
  2. Increase gradient_accumulation_steps
  3. Check that batch is properly split across devices
  4. Verify device memory with tt-smi -m

DDP Best Practices

1. Keep Effective Batch Constant

When scaling devices, adjust batch_size and gradient_accumulation_steps to maintain:

effective_batch = batch_size × gradient_accumulation_steps × num_devices

Example:

N150: 8 × 4 × 1 = 32
N300: 16 × 2 × 2 = 64  # Oops, doubled effective batch!

Better N300: 8 × 2 × 2 = 32  # Same effective batch

2. Validate Results Match

After DDP training, verify that:

If results differ significantly:

3. Monitor Per-Device Metrics

Use logging to track:

Tools:

4. Start Small, Scale Up

Recommended progression:

  1. Debug on N150 (single device)
  2. Validate on N300 (2 devices)
  3. Scale to T3K (8 devices) when ready
  4. Consider Galaxy for production

Why: Easier to debug on fewer devices, then scale with confidence.


Gradient Synchronization Deep Dive

What Gets Synchronized?

After each backward pass:

  1. Each device computes local gradients
  2. All-reduce operation averages gradients across devices
  3. Each device gets the averaged gradient
  4. Optimizer updates weights using averaged gradient

Communication Patterns

Ring All-Reduce (efficient for large models):

Device 0 ←→ Device 1 ←→ ... ←→ Device N

Why it matters:

Profiling Communication

# In training script (advanced)
import ttnn

with ttnn.profile() as prof:
    # Training step
    loss.backward()
    optim.step()

# Analyze communication vs compute time
print(prof.summary())

Ideal ratio: Communication < 10% of total time.


Scaling Your Ambitions: From Prototype to Production

You've learned the mechanics of multi-device training. But what does scaling really enable? Let's explore how multi-device training transforms what you can build.

The Scaling Journey

Week 1: Prototype on N150

Week 2: Iterate on N300

Month 2: Scale on T3K

Production: Deploy with confidence

Real-World Scaling Success Stories

🚀 "Code Review Bot" (Startup → Enterprise)

💼 "Legal Document Generator" (Consulting → SaaS)

🎮 "Game NPC Dialogue" (Indie → AAA)

🏥 "Medical Report Assistant" (Research → Clinical)

What Multi-Device Training Really Gives You

It's not just about speed. It's about:

Experimentation velocity

🎯 Dataset scale

🚀 Model complexity

💰 Economic viability

Your Multi-Device Roadmap

Month 1 (N150 - Learning):

Month 2 (N300 - Optimizing):

Month 3+ (T3K - Scaling):

Production (Right-sized hardware):

The Power Law of Training Scale

Here's what most developers don't realize:

Why the multiplier effect?

It's not linear. It's exponential.

From Learning to Leading

You now understand:

The question isn't "Should I scale to multi-device?"

The question is "How fast do I want to iterate and learn?"

Start where you are. Scale when you're ready. The path is clear.


Key Takeaways

DDP scales training to multiple devices efficiently

N300 provides ~2x speedup over N150

Keep effective batch size constant for fair comparison

Gradient synchronization ensures all devices stay in sync

Start with single device, scale up after validation

Monitor per-device metrics to catch issues early


Next Steps

Lesson CT-6: Experiment Tracking

You've learned to train on single and multiple devices. Next, learn to track and compare experiments:

  1. WandB integration for experiment tracking
  2. Compare hyperparameter variations
  3. Visualize training curves
  4. Share results with team

Estimated time: 10-15 minutes Prerequisites: CT-4, CT-5

Or skip to:

Lesson CT-7: Model Architecture Basics

Understand transformer components before training from scratch.


Additional Resources

Documentation

Configuration Examples

Profiling Tools


Ready to track your experiments? Continue to Lesson CT-6: Experiment Tracking