N150 N300 T3K P100 P150 P300C Galaxy 15 min Blocked

Experiment Tracking

Learn to track, compare, and visualize training experiments using file-based logging and optional WandB integration.

What You'll Learn

Time: 10-15 minutes | Prerequisites: CT-4 (Fine-tuning Basics)


Why Track Experiments?

The Problem

You run 10 training experiments with different hyperparameters:

Which batch size worked best?
Which learning rate converged fastest?
Did that checkpoint from Tuesday outperform today's?

Without tracking: Scroll through terminal logs, compare files manually, rely on memory.

With tracking: Compare all runs at a glance, see visualizations, make data-driven decisions.


Approach 1: File-Based Tracking (Baseline)

What's Already Tracked

The train.py script automatically logs:

1. Training log:

output/training.log

Contains:

2. Validation samples:

output/validation.txt

Contains:

3. Training curves:

output/training_curves.png

Visualizes:

Organizing Experiments

Bad (hard to track):

output/
  final_model/
  training.log

Good (organized by date/name):

experiments/
  2026-02-01_baseline/
    config.yaml
    training.log
    validation.txt
    training_curves.png
    final_model/

  2026-02-01_higher_lr/
    config.yaml
    training.log
    validation.txt
    training_curves.png
    final_model/

Manual Comparison Script

Create compare_experiments.sh:

#!/bin/bash

echo "Experiment Comparison"
echo "===================="

for exp in experiments/*/; do
  echo ""
  echo "Experiment: $(basename $exp)"

  # Extract final loss
  final_loss=$(tail -5 "$exp/training.log" | grep "Final training loss" | awk '{print $NF}')
  echo "  Final Loss: $final_loss"

  # Extract config values
  lr=$(grep "learning_rate:" "$exp/config.yaml" | awk '{print $2}')
  batch=$(grep "batch_size:" "$exp/config.yaml" | awk '{print $2}')
  echo "  LR: $lr, Batch: $batch"
done

Run:

chmod +x compare_experiments.sh
./compare_experiments.sh

Output:

Experiment Comparison
====================

Experiment: 2026-02-01_baseline
  Final Loss: 1.84
  LR: 0.0001, Batch: 8

Experiment: 2026-02-01_higher_lr
  Final Loss: 1.92
  LR: 0.0002, Batch: 8

What is Weights & Biases?

WandB is a cloud-based experiment tracking platform:

Website: wandb.ai

Setup (One-Time)

1. Create account:

# Visit wandb.ai and sign up (free)

2. Install WandB:

pip install wandb

3. Login:

wandb login

Paste your API key when prompted.

4. Enable in config:

# configs/training_n150_wandb.yaml
training_config:
  # ... other settings ...

  use_wandb: true                  # Enable WandB
  wandb_project: "my-training-project"
  wandb_run_name: "n150-baseline"

Running with WandB

cd ~/tt-scratchpad/training
python train.py --config configs/training_n150_wandb.yaml

What gets logged:

WandB Dashboard

After training starts, you'll see:

wandb: 🚀 View run at https://wandb.ai/your-username/my-training-project/runs/abc123

Dashboard shows:

  1. Overview tab:

    • Run summary (final loss, duration)
    • Hyperparameters
    • System info
  2. Charts tab:

    • Real-time loss curves
    • Custom plots
    • Compare with other runs
  3. Logs tab:

    • Generated text samples
    • Validation outputs
  4. Files tab:

    • Config files
    • Saved artifacts

Comparing Experiments

Scenario: Finding Best Learning Rate

You want to try 3 learning rates: 5e-5, 1e-4, 2e-4

1. Run three experiments:

# Experiment 1: LR = 5e-5
python train.py \
  --config configs/training_n150_lr_5e5.yaml

# Experiment 2: LR = 1e-4
python train.py \
  --config configs/training_n150_lr_1e4.yaml

# Experiment 3: LR = 2e-4
python train.py \
  --config configs/training_n150_lr_2e4.yaml

2. Compare in WandB:

Go to your project page, click "Compare runs":

3. Identify best:

Run 1 (5e-5):  Final val loss: 2.34  (too slow)
Run 2 (1e-4):  Final val loss: 2.12  (best!)
Run 3 (2e-4):  Final val loss: 2.28  (too aggressive)

Conclusion: LR = 1e-4 is optimal.


Advanced WandB Features

1. Logging Custom Metrics

Add to your training script:

import wandb

# After optimizer step
wandb.log({
    "train_loss": avg_loss,
    "learning_rate": current_lr,
    "gradient_norm": grad_norm,
    "step": opt_step
})

# After validation
wandb.log({
    "val_loss": val_loss,
    "sample_output": generated_text,
    "step": opt_step
})

2. Hyperparameter Sweeps

Automate hyperparameter search:

# sweep.yaml
program: train.py
method: grid
parameters:
  learning_rate:
    values: [5e-5, 1e-4, 2e-4]
  batch_size:
    values: [8, 16]

Run sweep:

wandb sweep sweep.yaml
wandb agent your-username/my-training-project/sweep-id

WandB automatically runs all combinations!

3. Model Artifacts

Save checkpoints to WandB:

import wandb

# After saving checkpoint
artifact = wandb.Artifact('trained-model', type='model')
artifact.add_dir('output/final_model')
wandb.log_artifact(artifact)

Benefits:

4. Group Experiments

Organize related runs:

wandb.init(
    project="my-training-project",
    group="lr-search",              # Group related experiments
    tags=["n150", "baseline"],      # Add tags for filtering
)

Best Practices for Experiment Management

1. Naming Convention

Use descriptive names:

Good:  "2026-02-01_n150_lr1e4_batch8_baseline"
Bad:   "experiment_1"

Include key info:

2. Version Control Configs

# Save configs alongside code
git add configs/training_n150_lr1e4.yaml
git commit -m "Add config for LR=1e-4 experiment"
git tag exp-lr1e4

Why: Reproducibility - know exactly what config produced results.

3. Document Results

Create experiments.md:

# Custom Training Experiments

## Experiment 1: Baseline (2026-02-01)
- **Config:** training_n150.yaml
- **Hardware:** N150
- **Duration:** 2.3 hours
- **Final Loss:** 1.84 (train), 2.12 (val)
- **Result:** Good baseline, will try higher LR next
- **WandB:** [link](https://wandb.ai/...)

## Experiment 2: Higher LR (2026-02-01)
- **Config:** training_n150_lr2e4.yaml
- **Hardware:** N150
- **Duration:** 2.1 hours
- **Final Loss:** 1.92 (train), 2.28 (val)
- **Result:** Slightly worse, LR=1e-4 is better
- **WandB:** [link](https://wandb.ai/...)

4. Archive Failed Experiments

Don't delete failures - they teach you what doesn't work!

experiments/
  successful/
    2026-02-01_baseline/
  failed/
    2026-01-30_lr_too_high/       # Exploded at step 50
    2026-01-31_batch_too_large/   # OOM error

5. Regular Cleanup

Keep last 5 checkpoints, archive older:

# Keep only step 400, 500 checkpoints
rm -rf output/checkpoint_step_100
rm -rf output/checkpoint_step_200
rm -rf output/checkpoint_step_300

# Or archive to S3/NAS
tar -czf checkpoints_baseline.tar.gz output/
mv checkpoints_baseline.tar.gz /archive/

Visualization Tips

Loss Curve Analysis

Healthy training:

Loss
  4 |*
    | *
  3 |  **
    |    ***
  2 |       *****
    |            -------
  1 |___________________
    0   100   200   300   400   500
                Steps

Overfitting:

Loss
  4 |*
    | *                  Train
  3 |  **  *****----
    |
  2 |       Val -------↗
    |
  1 |___________________
    0   100   200   300   400   500

Underfitting:

Loss
  4 |*  **  **  **  **
    |
  3 |
    |
  2 |
    |
  1 |___________________
    0   100   200   300   400   500

Experiment Workflow Template

Phase 1: Baseline (1 run)

Goal: Get something working
- Use default config
- Verify training completes
- Check sample outputs

Phase 2: Hyperparameter Search (3-5 runs)

Goal: Find optimal settings
- Try 3 learning rates
- Try 2 batch sizes
- Keep other settings constant

Phase 3: Refinement (2-3 runs)

Goal: Polish best config
- Take best from Phase 2
- Try minor variations
- Longer training

Phase 4: Validation (1 run)

Goal: Final confirmation
- Retrain with best config
- Full evaluation
- Document results

Total: 7-10 experiments to find optimal settings.


Beyond This Lesson: From Ad-Hoc to Professional ML Engineering

You've learned the tools for tracking experiments. But what transforms scattered training runs into systematic ML engineering? Let's explore how experiment tracking becomes the foundation for data-driven model development.

Real-World ML Engineering Stories

What professional teams have built with systematic tracking:

🚀 "API Documentation Bot" (Solo dev → Team product)

💼 "Medical Report Classifier" (Research → Clinical deployment)

🎮 "Game NPC Dialogue" (Indie studio, 2-person team)

🏥 "Radiology Assistant" (Startup → FDA submission)

The Data-Driven ML Development Cycle

How tracking transforms your workflow:

graph TD
    A[Traditional ApproachNo Tracking] --> B[Try config]
    B --> C[Wait for training]
    C --> D[Check result]
    D --> E[Forget what you tried]
    E --> F[Try something random]
    F --> B

    G[Data-Driven ApproachWith Tracking] --> H[Form hypothesis]
    H --> I[Design experiments]
    I --> J[Run multiple configs]
    J --> K[Compare results in WandB]
    K --> L[Identify patterns]
    L --> M[Make informed decision]
    M --> N[Document findings]
    N --> H

    style A fill:#E85D75,stroke:#333,stroke-width:2px
    style G fill:#50C878,stroke:#333,stroke-width:2px
    style E fill:#FF6B6B,stroke:#333,stroke-width:2px
    style L fill:#7B68EE,stroke:#333,stroke-width:3px

The difference:

What Systematic Tracking Reveals

Insights you miss without tracking:

🔍 "The batch size doesn't matter... until it does"

📊 "Our validation set was too easy"

⏱️ "We were overtraining by 300%"

💡 "Dataset quality beats hyperparameter tuning"

From Experiments to Insights: The Tracking Hierarchy

Level 1: Survival (File logs)

Level 2: Efficiency (Organized files)

Level 3: Professional (WandB basic)

Level 4: Team-Scale (WandB advanced)

Level 5: Production (WandB + CI/CD)

The Compound Interest of Good Tracking

Month 1:

Month 3:

Month 6:

Year 1:

Your Experiment Tracking Journey

This week (File-based tracking):

Next week (WandB setup):

Month 2 (Professional workflow):

Production (Systematic ML engineering):

The Questions Tracking Answers

Without tracking, you wonder:

With tracking, you know:

The transformation:

Imagine: Your ML Engineering Future

You now understand:

What will you build with systematic tracking?

The question isn't "Should I track experiments?"

The question is "How many breakthroughs am I missing without tracking?"


Key Takeaways

File-based tracking works for simple cases

WandB scales to many experiments effortlessly

Compare runs side-by-side to make informed decisions

Use consistent naming and documentation

Don't delete failed experiments - learn from them

Version control your configs


Next Steps

You've completed the core Custom Training lessons (CT-1 through CT-6)!

Optional Advanced Lessons:

Lesson CT-7: Model Architecture Basics

Understand transformer components before training from scratch:

Lesson CT-8: Training from Scratch

Build a tiny model (10-20M parameters) from ground up:

Or apply your knowledge:

  1. Create your own dataset for a specific task
  2. Fine-tune for your use case
  3. Deploy with vLLM (Lesson 7)
  4. Share your results with the community!

Additional Resources

WandB

Experiment Management

Visualization


Congratulations on completing the core Custom Training series! 🎉

You now have the tools to fine-tune models, scale to multiple devices, and track experiments professionally.

Continue to Lesson CT-7: Model Architecture Basics for deep understanding, or start building your own custom models!