Configuration Patterns
Master YAML-driven training configuration using patterns from tt-blacksmith. Learn to configure hardware, logging, checkpointing, and hyperparameters.
What You'll Learn
- YAML configuration structure (tt-blacksmith pattern)
- Training hyperparameters and their effects
- Device configuration (single vs multi-chip)
- Logging and experiment tracking
- Checkpoint management strategies
- Hardware-specific optimization
Time: 15 minutes | Prerequisites: CT-2 (Dataset Fundamentals)
Why Configuration-Driven Training?
Don't hardcode values. Use config files.
Think about cooking: Would you rather memorize every ingredient quantity, or use a recipe you can share, modify, and perfect over time? Configuration files are your training recipes.
The Power of Configuration
Reproducibility is everything. When you find a config that works, you want to be able to recreate those exact results. Same config file → same training behavior → same model quality. No hunting through code to remember what learning rate you used three weeks ago.
Experimentation becomes systematic. Want to try a higher learning rate? Change one line in your config, rerun. Compare results. Keep the winner. No code changes, no risk of breaking something else. Configuration files let you experiment fearlessly.
Sharing is effortless. Instead of writing "I used batch size 8, learning rate 0.0001, AdamW optimizer with weight decay 0.01, gradient clipping at 1.0..." just send your config file. Everything's there. Your colleague runs the exact same setup in seconds.
Version control tells the story. When you track config files in git, you see exactly what changed between runs. "Oh, this commit lowered the learning rate from 1e-4 to 5e-5 - that's when training stabilized." The history writes itself.
Documentation that never lies. Comments in code get out of sync. Config files can't lie - they are what the training run actually used. Self-documenting by necessity.
The tt-blacksmith Way
tt-blacksmith uses comprehensive YAML configs with standardized sections. Here's how they fit together:
graph TD
A[YAML Config File] --> B[training_configCore Training Settings]
A --> C[device_configHardware Setup]
A --> D[eval_configValidation Settings]
B --> E[Hyperparametersbatch_size, learning_rate, epochs]
B --> F[CheckpointingSave frequency, strategy]
B --> G[LoggingWandB, file output, log level]
B --> H[OptimizerAdamW, weight decay, gradient clipping]
C --> I[Single/Multi Deviceenable_ddp, mesh_shape]
D --> J[Sampling Parameterstemperature, top_k, top_p]
style A fill:#4A90E2,stroke:#333,stroke-width:3px
style B fill:#7B68EE,stroke:#333,stroke-width:2px
style C fill:#7B68EE,stroke:#333,stroke-width:2px
style D fill:#7B68EE,stroke:#333,stroke-width:2px
style E fill:#6C757D,stroke:#333,stroke-width:1px
style F fill:#6C757D,stroke:#333,stroke-width:1px
style G fill:#6C757D,stroke:#333,stroke-width:1px
style H fill:#6C757D,stroke:#333,stroke-width:1px
style I fill:#6C757D,stroke:#333,stroke-width:1px
style J fill:#6C757D,stroke:#333,stroke-width:1px
Why this structure?
- Logical grouping: Related settings stay together
- Easy to navigate: Find what you need quickly
- Consistent across projects: Same pattern everywhere
- Self-documenting: Structure tells you what each section controls
We'll follow this pattern throughout the Custom Training series.
Configuration File Structure
Full Example: training_n150.yaml
# Training Configuration for N150 (Single Wormhole Chip)
#
# Optimized for single-chip development hardware
# Typical training time: 1-3 hours depending on dataset size
training_config:
model_type: "llama"
seed: 42
batch_size: 8 # N150: Conservative for DRAM limits
validation_batch_size: 2
num_epochs: 3 # Adjust based on dataset size
max_steps: 5000 # Maximum training steps
learning_rate: 0.0001 # Standard for fine-tuning
weight_decay: 0.01
use_moreh_adamw: true
use_kahan_summation: false
use_clip_grad_norm: true
clip_grad_norm_max_norm: 1.0
gradient_accumulation_steps: 4 # Effective batch: 8 * 4 = 32
eval_every: 50 # Validate every 50 steps
model_save_interval: 100 # Checkpoint every 100 steps
tokenizer_type: "bpe"
checkpoint_dir: "checkpoints"
model_config: "model_configs/model.yaml"
# Logging configuration (tt-blacksmith pattern)
log_level: "INFO"
use_wandb: false # Optional experiment tracking
wandb_project: "my-training"
wandb_run_name: "n150-experiment"
# Checkpoint strategy (tt-blacksmith pattern)
checkpoint_frequency: 100 # Save every 100 steps
validation_frequency: 50 # Validate every 50 steps
save_strategy: "steps" # Save based on steps (not epochs)
# NOTE: v0.64.5+ uses constant learning_rate, no scheduler_config needed
eval_config:
repetition_penalty: 1.0
temperature: 0.0 # Greedy decoding for validation
top_k: 50
top_p: 1.0
device_config:
enable_ddp: False # N150: Single chip, no DDP
mesh_shape: [1, 1] # 1x1 mesh (single device)
Let's break down each section.
Section 1: Training Configuration
Core Hyperparameters
| Parameter | What It Does | Typical Values | Example (N150) |
|---|---|---|---|
batch_size |
Examples per training step | 4-32 | 8 (DRAM conservative) |
learning_rate |
How fast model learns | 1e-5 to 1e-4 | 1e-4 (fine-tuning LR) |
num_epochs |
Passes through full dataset | 1-10 | 3 (typical fine-tuning) |
max_steps |
Total training steps | 100-5000 | 500 (1-3 hours) |
weight_decay |
Regularization strength | 0.0-0.1 | 0.01 (mild regularization) |
Batch Size Deep Dive
Think of batch size like teaching multiple students at once versus one-on-one tutoring. Batch size is how many training examples the model sees before updating its weights.
Larger batches (16-32) are like teaching a classroom. You show 16 different examples, collect feedback from all of them, then make one consolidated update to the model. The advantage? You get more consistent, stable feedback across different perspectives. Training moves faster because each update is based on broader evidence. The downside? You need more resources - specifically, more memory to hold all those examples at once.
Smaller batches (4-8) are like tutoring individuals. You show 4 examples, update immediately. The feedback is noisier - each small batch might have quirks that don't represent the full dataset. Progress is slower because you're making more frequent, smaller updates. But here's the win: it works with limited resources.
On N150, we're memory-constrained compared to massive GPU clusters with 80GB+ VRAM. The N150's DRAM is fantastic for its purpose, but we're not running H100s here. Batch size 8 is our sweet spot - conservative enough to always work, large enough to make meaningful progress. Got a particularly small model? You might push to 16. But 8 is the safe starting point that won't exhaust your DRAM mid-training.
Here's the clever trick: effective_batch_size = batch_size × gradient_accumulation_steps
Example: 8 × 4 = 32 effective batch size
You can simulate the stability of batch size 32 while only holding 8 examples in memory at once! We'll cover gradient accumulation next.
Learning Rate Deep Dive
Think of learning rate like adjusting the steering wheel when driving. Learning rate controls how aggressively the model updates its weights after seeing each batch.
Too aggressive (1e-3, ten times too high)? You overcorrect wildly. The model's loss starts bouncing all over the place, then explodes into NaN errors. It's like jerking the steering wheel hard left, then hard right, then harder left - you're not making progress, you're just creating chaos. The model literally forgets everything it knew and becomes useless.
Too timid (1e-6, a hundred times too low)? You barely turn the wheel at all. Training becomes painfully slow. After hours of compute, the loss has barely budged. You might not converge at all - the model never learns the patterns you're trying to teach it. Progress is so incremental that you're wasting time and electricity.
Just right (1e-4 to 1e-5)? Smooth, steady improvement. The loss curve descends consistently. The model absorbs your training data without catastrophic forgetting. You make measurable progress every few steps. This is the Goldilocks zone.
Starting point: 1e-4 (0.0001). This is the sweet spot for fine-tuning pre-trained models. Nine times out of ten, it just works.
If your loss is jumpy and unstable, the model is learning too aggressively. Lower the learning rate to 5e-5 or even 1e-5. You'll see the loss curve smooth out and training stabilize.
If your loss barely moves after 50-100 steps, you're being too conservative. Bump it up to 2e-4. Give the model permission to learn faster.
Why do we use lower learning rates for fine-tuning than for training from scratch?
Because the pre-trained weights are already good! They represent millions of dollars of compute and massive datasets. We're not starting from random noise - we're starting from a model that already speaks English (or code, or whatever domain). We want to nudge those weights toward our specific task, not overwrite them with aggressive updates. Think of it like editing a draft, not rewriting from scratch.
Gradient Accumulation
This is one of the cleverest tricks in deep learning. Gradient accumulation lets you simulate a large batch size while only holding a small batch in memory.
Think of it like polling a large group before making a decision. You can't fit 32 people in your office at once, but you can interview them in groups of 8, collect all their feedback, then make one consolidated decision based on all 32 opinions. That's gradient accumulation.
Here's how it works:
You set batch_size = 8 (fits in N150 DRAM) and gradient_accumulation_steps = 4. Now:
- Step 1: Process batch of 8 examples, compute gradients, but don't update weights yet - just save the gradients
- Step 2: Process another 8 examples, add their gradients to the saved ones
- Step 3: Process another 8 examples, keep accumulating
- Step 4: Process final 8 examples (32 total now), average all the gradients, now update weights
Effective batch size: 8 × 4 = 32
You get the training stability of a 32-example batch while only needing memory for 8 examples at once. It's like having your cake and eating it too.
The benefits are clear: Training becomes more stable (larger effective batch smooths out noise), and you don't run out of memory. The trade-off? Slightly slower training because you're doing 4 forward passes before each backward pass. But on memory-constrained hardware like N150, this trade-off is absolutely worth it.
On N150, use gradient accumulation always. Set gradient_accumulation_steps = 4 as your default. On N300 or T3K with more memory available, you might not need it - you can just use larger actual batches. But even on big hardware, gradient accumulation remains useful when you want to push batch sizes beyond what physically fits in memory.
Epochs vs Steps
Two ways to control how long training runs: count how many times you've seen the full dataset (epochs), or count how many training iterations you've done (steps). Both work, but they're useful in different situations.
Epochs are like saying "read the entire textbook 3 times." An epoch is one complete pass through your dataset. If you have 100 training examples and set num_epochs = 3, the model sees all 100 examples three times. The total number of training steps depends on your batch size:
num_epochs = 3
dataset_size = 100
batch_size = 8
→ steps_per_epoch = 100 / 8 = 12.5 (rounds to 12-13)
→ total_steps = 3 × 12 = 36 steps
Steps are like saying "study for 500 minutes, regardless of how many chapters that covers." With max_steps = 500, training runs for exactly 500 iterations, no matter how large or small your dataset is. You might see the full dataset dozens of times (small dataset) or only see a fraction of it (huge dataset).
Which should you use?
For small datasets (50-500 examples), use max_steps for better control. With a 50-example dataset and batch size 8, one epoch is only 6-7 steps. Setting num_epochs = 3 would give you just 18-21 steps total - barely enough to learn anything. Instead, set max_steps = 500 and let the model see those 50 examples many times over. Small datasets need repetition to extract patterns.
For large datasets (10,000+ examples), use num_epochs as your natural unit. With 10,000 examples and batch size 8, one epoch is 1,250 steps. Setting num_epochs = 3 gives you 3,750 steps - substantial training. Epochs feel more intuitive here because they correspond to meaningful milestones.
Example calculation for small dataset: 50 examples, batch size 8 → 6-7 steps per epoch → 500 max_steps ≈ 80 epochs
Don't panic at "80 epochs"! This is completely normal for small datasets. The model needs to see those patterns dozens of times to internalize them. You're not overfitting - you're learning deeply from limited data.
Section 2: Device Configuration
Single Device (N150)
device_config:
enable_ddp: False # No distributed training
mesh_shape: [1, 1] # 1 row × 1 column = 1 device
When to use:
- N150 (single Wormhole chip)
- Development and debugging
- Small models (1-3B parameters)
Multi-Device (N300)
device_config:
enable_ddp: True # Distributed Data Parallel
mesh_shape: [1, 2] # 1 row × 2 columns = 2 devices
What changes:
- Batch split across devices
- Gradients synchronized after backward pass
- ~2x faster training
When to use:
- N300 (dual Wormhole chips)
- Larger models or larger batches
- Faster iteration for experimentation
Advanced (T3K, Galaxy)
device_config:
enable_ddp: True
mesh_shape: [2, 4] # 2 rows × 4 columns = 8 devices
When to use:
- T3K (8 chips in mesh)
- Galaxy (32+ chips)
- Large-scale training or research
Note: Lesson CT-5 covers multi-device training in detail.
Section 3: Optimizer Configuration
AdamW (Default Choice)
training_config:
use_moreh_adamw: true # TT-optimized AdamW
weight_decay: 0.01 # L2 regularization
use_kahan_summation: false # Numerical stability (optional)
AdamW advantages:
- ✅ Adaptive learning rates per parameter
- ✅ Momentum (better convergence)
- ✅ Weight decay (regularization)
- ✅ Industry standard for LLMs
Alternatives:
- SGD: Simpler, sometimes better for small models
- AdamW with Kahan: Better numerical precision (slower)
Recommendation: Stick with AdamW unless you have specific reasons not to.
Gradient Clipping
training_config:
use_clip_grad_norm: true
clip_grad_norm_max_norm: 1.0 # Clip gradients above this norm
Why clip gradients?
Prevents exploding gradients - when gradients become huge and cause NaN errors.
When to use:
- ✅ Always (it's a safety net)
- ✅ Especially with RNNs/Transformers
- ⚠️ If training is stable, can disable for slight speedup
Typical values: 0.5 to 1.0
Section 4: Checkpointing Strategy
Basic Checkpointing
training_config:
model_save_interval: 100 # Save every 100 steps
checkpoint_dir: "checkpoints_n150"
What gets saved:
- Model weights (safetensors format)
- Optimizer state (for resuming)
- Training step number
Why checkpoint?
- ✅ Training crashes → resume from last checkpoint
- ✅ Epoch 47 was best → load that checkpoint
- ✅ Share checkpoints with collaborators
Advanced Strategy (tt-blacksmith pattern)
training_config:
checkpoint_frequency: 100 # How often to save
save_strategy: "steps" # "steps" or "epoch"
validation_frequency: 50 # Validate more often than save
save_strategy options:
- "steps": Save every N steps (fine-grained control)
- "epoch": Save after each epoch (natural for large datasets)
Best practices:
- Validate more frequently than saving (catch issues early)
- Keep last 3-5 checkpoints (disk space vs safety)
- Save final model separately (easy to find)
Section 5: Logging Configuration
Basic Logging (File-based)
training_config:
log_level: "INFO" # INFO, DEBUG, WARNING, ERROR
# File logging is always enabled
What gets logged:
- Training loss per step
- Validation loss
- Generated sample outputs
- Hyperparameters
Output files:
training.log- All training outputvalidation.txt- Sample generationstraining_curves.png- Loss visualization
Advanced: WandB Integration (Optional)
training_config:
use_wandb: false # Enable for experiment tracking
wandb_project: "my-training-project"
wandb_run_name: "n150-experiment-1"
What is WandB (Weights & Biases)?
Cloud-based experiment tracking:
- 📊 Beautiful loss curves
- 🔍 Compare multiple runs
- 📝 Log hyperparameters automatically
- 🖼️ Visualize sample outputs
- 👥 Share with team
When to use:
- ✅ Multiple experiments to compare
- ✅ Collaborative projects
- ✅ Production ML workflows
- ⚠️ Requires internet, account (free tier available)
When to skip:
- Single experiment, local-only
- Offline environment
- Privacy-sensitive projects
Note: Lesson CT-6 covers experiment tracking in detail.
Section 6: Evaluation Configuration
eval_config:
repetition_penalty: 1.0 # Penalize repeated tokens
temperature: 0.0 # Greedy (deterministic) sampling
top_k: 50 # Consider top-K tokens
top_p: 1.0 # Nucleus sampling threshold
Sampling Parameters Explained
| Parameter | Effect | Validation | Inference |
|---|---|---|---|
temperature |
Randomness (0=greedy, 1+=creative) | 0.0 (deterministic) | 0.7-1.0 (varied) |
top_k |
Only consider top K tokens | 50 | 40-80 |
top_p |
Nucleus sampling (cumulative probability) | 1.0 (disabled) | 0.9-0.95 |
repetition_penalty |
Discourage repeating tokens | 1.0 (disabled) | 1.1-1.3 |
For validation:
- Use
temperature=0.0(greedy) for consistent evaluation - Same prompt always generates same output
- Easy to spot improvements
For inference:
- Use
temperature=0.7-1.0for variety - Adjust based on use case (creative vs factual)
Hardware-Specific Configurations
N150: Memory-Constrained
training_config:
batch_size: 8 # Conservative
gradient_accumulation_steps: 4 # Simulate batch_size=32
device_config:
enable_ddp: False
mesh_shape: [1, 1]
Key trade-offs:
- Slower training (smaller batch)
- Lower memory usage
- Single-device simplicity
N300: Balanced Performance
training_config:
batch_size: 16 # Larger batch with DDP
gradient_accumulation_steps: 2 # Still effective_batch=32
device_config:
enable_ddp: True
mesh_shape: [1, 2]
Key improvements:
- ~2x faster training
- Better GPU utilization
- Minimal code changes
T3K: High Performance
training_config:
batch_size: 32 # Large batch
gradient_accumulation_steps: 1 # No accumulation needed
device_config:
enable_ddp: True
mesh_shape: [2, 4] # 8 devices
Key advantages:
- ~8x faster training
- Experiment rapidly
- Train larger models
Common Configuration Mistakes
❌ Don't: Set Learning Rate Too High
learning_rate: 0.001 # 10x too high for fine-tuning!
Result: Loss explodes, model forgets everything, NaN errors.
Fix: Use 0.0001 (1e-4) for fine-tuning.
❌ Don't: Disable Gradient Clipping
use_clip_grad_norm: false # Risky!
Result: Occasional training crashes from exploding gradients.
Fix: Keep it enabled unless you have good reason not to.
❌ Don't: Save Too Frequently
model_save_interval: 1 # Save every step!
Result: Hundreds of checkpoints, disk space exhausted, slow I/O.
Fix: Save every 50-100 steps for small jobs, 500-1000 for large.
❌ Don't: Ignore Validation
validation_frequency: 99999 # Never validate
Result: Model overfits, you don't notice until the end.
Fix: Validate every 50-100 steps, check sample outputs.
❌ Don't: Mix Single/Multi-Device Settings
device_config:
enable_ddp: True # DDP enabled...
mesh_shape: [1, 1] # ...but only 1 device?
Result: Confusing errors or unexpected behavior.
Fix: enable_ddp: False for [1,1], enable_ddp: True for [1,2] or larger.
Configuration Experimentation Workflow
Experimentation is the heart of ML engineering. Here's how to systematically improve your models through config changes:
graph TD
A[Start: Baseline Configtraining_n150.yaml] --> B[Run TrainingMonitor loss & samples]
B --> C{Results Good?}
C -->|Yes| D[🎉 Use This ConfigSave as production]
C -->|No| E[Identify IssueLoss too high? Overfit? Slow?]
E --> F{What to Change?}
F -->|Loss not improving| G[Try Higher LR1e-4 → 2e-4]
F -->|Loss jumpy/unstable| H[Try Lower LR1e-4 → 5e-5]
F -->|Training too slow| I[Increase batch_size8 → 16]
F -->|Out of memory| J[Decrease batch_sizeor add gradient_accumulation]
G --> K[Run ExperimentChange ONE parameter]
H --> K
I --> K
J --> K
K --> L[Compare ResultsBetter or worse?]
L -->|Better| M[Keep ChangeUpdate baseline]
L -->|Worse| N[Revert ChangeTry something else]
M --> O{More to Try?}
N --> O
O -->|Yes| F
O -->|No| D
style A fill:#4A90E2,stroke:#333,stroke-width:2px
style B fill:#7B68EE,stroke:#333,stroke-width:2px
style D fill:#50C878,stroke:#333,stroke-width:3px
style E fill:#E85D75,stroke:#333,stroke-width:2px
style K fill:#7B68EE,stroke:#333,stroke-width:2px
style L fill:#E85D75,stroke:#333,stroke-width:2px
Key principle: Change one thing at a time.
1. Start with Baseline Config
Use a baseline config appropriate for your hardware as-is. This is your reference point.
2. Change One Thing at a Time
Good approach:
Run 1: batch_size=8, lr=1e-4
Run 2: batch_size=16, lr=1e-4 # Changed batch size only ✅
Run 3: batch_size=16, lr=5e-5 # Changed LR only ✅
Bad approach:
Run 1: batch_size=8, lr=1e-4, steps=500
Run 2: batch_size=16, lr=5e-5, steps=1000 # Changed everything! ❌
Why? If Run 2 is better, you won't know if it was the batch size, learning rate, or step count that made the difference. Scientific method requires isolating variables.
3. Track Results
Create experiments.md:
## Experiment 1: Baseline
- Config: training_n150.yaml
- Final train loss: 2.34
- Final val loss: 2.56
- Sample output: "Good"
## Experiment 2: Higher Batch Size
- Config: batch_size=16
- Final train loss: 2.21
- Final val loss: 2.48
- Sample output: "Better!"
- **Conclusion:** Larger batch helps
4. Version Your Configs
configs/
training_n150_v1.yaml # Baseline
training_n150_v2.yaml # Higher batch
training_n150_v3.yaml # Lower LR
Why: Know which config produced which model.
Configuration Templates
Quick Start (Just Train!)
training_config:
batch_size: 8
learning_rate: 0.0001
max_steps: 5000
model_config: "model_configs/model.yaml"
checkpoint_dir: "checkpoints"
device_config:
enable_ddp: False
mesh_shape: [1, 1]
Use when: You want to get started quickly, no frills.
Research (Maximum Visibility)
training_config:
batch_size: 8
learning_rate: 0.0001
max_steps: 5000
validation_frequency: 25 # Validate often
checkpoint_frequency: 50 # Save often
use_wandb: true # Track everything
log_level: "DEBUG" # Verbose logging
device_config:
enable_ddp: False
mesh_shape: [1, 1]
Use when: Debugging, research, need full visibility.
Production (Fast Iteration)
training_config:
batch_size: 16
learning_rate: 0.0001
max_steps: 500
validation_frequency: 100 # Less frequent
checkpoint_frequency: 250 # Only keep key checkpoints
use_wandb: false # Simple file logging
device_config:
enable_ddp: True
mesh_shape: [1, 2]
Use when: Iterating on production models, N300+ available.
Real-World Configuration Scenarios
Configuration isn't just about technical settings - it's about solving real problems within constraints. Let's explore how different scenarios drive different config choices.
Scenario 1: The Medical Chatbot (Privacy-First)
Challenge: Fine-tune a model for medical Q&A within HIPAA constraints.
Configuration decisions:
training_config:
batch_size: 4 # Small batches (limited patient data)
learning_rate: 5e-5 # Conservative to preserve medical knowledge
checkpoint_frequency: 50 # Frequent saves (expensive hardware time)
validation_frequency: 25 # Validate often (safety-critical)
use_wandb: false # NO cloud logging (HIPAA compliance)
log_level: "INFO" # Local-only logging
device_config:
enable_ddp: False # On-premise N150 only
mesh_shape: [1, 1]
Result: Production model in 2 hours on N150, deployable with vLLM (Lesson 7), fully compliant.
Total time: One afternoon of fine-tuning, months of value.
Scenario 2: The Code Translator (Speed Matters)
Challenge: PyTorch → TTNN translator for internal dev team. Need fast iteration.
Configuration decisions:
training_config:
batch_size: 16 # Larger batch on N300
learning_rate: 1e-4 # Standard fine-tuning LR
max_steps: 300 # Shorter runs for rapid experiments
checkpoint_frequency: 100 # Less frequent (iterate fast)
validation_frequency: 50 # Regular quality checks
use_wandb: true # Track 10+ experiments easily
wandb_project: "pytorch-to-ttnn"
device_config:
enable_ddp: True # N300 for 2x speedup
mesh_shape: [1, 2]
Result: Iterate through 10 model versions in 2 days. Find winning config. Deploy.
Impact: 500 examples → model that saves team 5 hours/week.
Scenario 3: The Research Experiment (Maximum Insight)
Challenge: Testing novel attention patterns. Need full visibility into training dynamics.
Configuration decisions:
training_config:
batch_size: 8 # Standard for N150
learning_rate: 1e-4
max_steps: 1000 # Longer run to see convergence
checkpoint_frequency: 50 # Save often (expensive compute)
validation_frequency: 25 # Validate very often
use_wandb: true # Essential for analysis
log_level: "DEBUG" # Maximum visibility
gradient_accumulation_steps: 4 # Simulate larger batch
eval_config:
temperature: 0.0 # Deterministic for fair comparison
device_config:
enable_ddp: False # Single device for simplicity
mesh_shape: [1, 1]
Result: Rich training logs, beautiful WandB visualizations, clear insights into what works.
Learning: Config isn't just for training - it's for understanding.
Scenario 4: The Production Pipeline (Reliability & Scale)
Challenge: Training custom models weekly for production deployment. Need consistency and speed.
Configuration decisions:
training_config:
batch_size: 32 # T3K can handle it
learning_rate: 1e-4
max_steps: 500
checkpoint_frequency: 250 # Only keep key checkpoints
validation_frequency: 100 # Less frequent (known dataset quality)
use_wandb: true # Track production runs
use_clip_grad_norm: true # Safety net
gradient_accumulation_steps: 1 # No accumulation needed
device_config:
enable_ddp: True # T3K mesh
mesh_shape: [2, 4] # 8 devices, 8x speedup
Result: Train multiple models per day. A/B test in production. Iterate based on user feedback.
Scale: From prototype (N150) → production (T3K) seamlessly. Same config pattern, different values.
What These Scenarios Teach Us
Configuration reflects your constraints:
- Privacy concerns → No cloud logging, local-only
- Speed requirements → Multi-device, shorter runs, WandB tracking
- Research goals → Maximum logging, frequent checkpoints, careful validation
- Production scale → Large batches, fast hardware, reliability features
The same tt-blacksmith pattern works for all scenarios. Only the values change.
Your Configuration Journey
Week 1 (N150, Learning):
- Use baseline configs from this extension
- Focus on understanding what each parameter does
- Experiment with one parameter at a time
- Goal: Build intuition
Week 2-3 (N150, Iterating):
- Apply lessons to your domain
- Create custom configs for your use case
- Track experiments systematically
- Goal: Find what works for your data
Month 2+ (N300/T3K, Scaling):
- Scale successful configs to faster hardware
- Run multiple experiments in parallel
- Build a library of proven configs
- Goal: Production-ready workflow
The power isn't in any single config value.
The power is in systematic experimentation, guided by configuration.
Key Takeaways
✅ Configuration-driven training is reproducible and shareable
✅ batch_size and learning_rate are your most important hyperparameters
✅ Gradient accumulation simulates larger batches
✅ Checkpoint frequently (but not too frequently)
✅ Validate more often than you save checkpoints
✅ Start with baseline config, change one thing at a time
✅ Use WandB for experiment tracking (optional but powerful)
Next Steps
Lesson CT-4: Fine-tuning Basics
You've prepared your dataset (CT-2) and configured your training (CT-3). Now it's time to actually train a model!
In CT-4, you'll:
- Install tt-train
- Launch your first fine-tuning job
- Monitor training progress
- Load and test your fine-tuned model
- See your model in action!
Estimated time: 20-25 minutes (+ 1-3 hours training time) Prerequisites: CT-2, CT-3
Additional Resources
Configuration Examples
- tt-train examples: Check
tt-metal/tt-train/sources/examples/for sample configs - tt-blacksmith: Reference patterns for config organization
- Your experiments: Build your own library of proven configs
Deep Dives
- Adam optimizer paper - Understanding adaptive LR
- Mixed precision training - BF16/FP32 techniques
- Learning rate schedules - Advanced scheduling
Tools
- WandB: wandb.ai - Experiment tracking
- TensorBoard: Alternative to WandB (local-only)
Ready to run your first training job? Continue to Lesson CT-4: Fine-tuning Basics →