N150 N300 T3K P100 P150 Galaxy 10 min Draft

Bounty Program: Model Bring-Up from Scratch

Learn how to tackle open bounties in the Tenstorrent Bounty Program by bringing up a new model on TT hardware. We'll use the successful Phi-3-mini-128k-instruct bounty (Issue #19416) as a real-world case study.


What is the Bounty Program?

Tenstorrent's bounty program rewards contributors (ranging from $500–$3000) for bringing new AI models to their hardware. Contributors are recognized for:

  1. Model functionality - Compiles and runs end-to-end inference
  2. Performance - Meets throughput benchmarks (25%/50%/70% of theoretical max)
  3. Accuracy - Validated against CPU baseline (top-1 >80%, top-5 >95%)
  4. Documentation - Clear build/run instructions for the community

Why Participate?


Bounty Difficulty Tiers

Difficulty Complexity Scope
Warmup First-time contributor tasks Getting familiar with the codebase
Easy Basic repo familiarity Straightforward implementations
Medium Significant domain knowledge Complex integrations
Hard Deep architectural expertise Novel architectures or optimizations

Performance-based tiers (for model bring-up):


Case Study: Phi-3-mini-128k-instruct (Issue #19416)

Timeline

May 1:   Contributor (ign-msati) expresses interest
May 2:   Officially assigned to issue
May 12:  Individual blocks functional, full network integration underway
May 26:  Testing with varied prefill lengths
May 29:  Pull request #22716 submitted
Status:  MERGED ✅ Contribution accepted

Key Success Factors

  1. Reused existing framework - Leveraged tt_transformers instead of duplicating code
  2. Minimal modifications - Extended RoPE scaling, adjusted chunk sizes
  3. Component-wise bring-up - Tested individual modules before full model
  4. Thorough testing - Unit tests, performance benchmarks, accuracy validation
  5. Clear communication - Regular updates to issue thread

Step-by-Step: Bringing Up a Model

Phase 1: Setup & Preparation

1.1 Find a Bounty

Browse open bounties:

# Visit GitHub issues page
open https://github.com/tenstorrent/tt-metal/labels/bounty

Filter by difficulty:

Get assigned:

1.2 Set Up Environment

Clone and build tt-metal:

cd ~
git clone https://github.com/tenstorrent/tt-metal.git
cd tt-metal
git submodule update --init --recursive

# Build TT-Metal (takes 10-20 minutes)
./build_metal.sh

# Set environment variables
export TT_METAL_HOME=~/tt-metal
export PYTHONPATH=$TT_METAL_HOME:$PYTHONPATH

Install dependencies:

# Python requirements
pip install -r requirements.txt
pip install -r models/tt_transformers/requirements.txt

# Additional tools
pip install pytest huggingface-hub

Verify hardware:

tt-smi  # Should detect your TT device

1.3 Run a Reference Demo

Test the environment with a proven model:

# Download a working model (e.g., Llama 3.1 8B)
export HF_MODEL=meta-llama/Llama-3.1-8B-Instruct
export MESH_DEVICE=N150  # or N300, T3K, etc.

# Run demo to verify setup
pytest models/tt_transformers/demo/simple_text_demo.py -k "performance and batch-1"

What to expect:


Phase 2: Baseline Validation

2.1 Run Reference Model on CPU

Critical first step: Ensure the model works correctly on CPU/GPU before attempting TT hardware.

# Create a reference validation script
cd ~/tt-scratchpad

Example reference script (save as validate_reference.py):

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "microsoft/Phi-3-mini-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Run inference
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"Prompt: {prompt}")
print(f"Response: {response}")

# Save logits for later comparison
torch.save(outputs, "reference_outputs.pt")

Run validation:

python validate_reference.py

2.2 Analyze Model Architecture

Inspect model configuration:

hf download microsoft/Phi-3-mini-128k-instruct config.json --local-dir ~/models/Phi-3
cat ~/models/Phi-3/config.json

Key questions to answer:

  1. Architecture type: LlamaForCausalLM, Qwen2ForCausalLM, MistralForCausalLM, Phi3ForCausalLM?
    • ✅ If listed in tt_transformers/README.md, likely compatible!
  2. Model dimensions: hidden_size, num_attention_heads, num_layers
    • Check if tile-aligned (divisible by 32 for TT hardware)
  3. Special features: RoPE scaling, sliding window attention, custom tokens?
  4. Hardware fit: Will it fit on target device? (N150 = 12GB, N300 = 24GB)

For Phi-3:


Phase 3: Component-Wise Bring-Up

Philosophy: Test small pieces before the full model. This is the MOST IMPORTANT phase.

3.1 Identify Similar Models

Find the closest match in tt_transformers:

ls models/tt_transformers/model_params/
# Phi-3 architecture is similar to Llama
# Use Llama as the base implementation

3.2 Bring Up Decode Stage First

Why decode first? Decode is simpler (batch=32, single token per user) and compute-bound.

Create unit tests for individual modules:

# Example: Test RMSNorm module
# Save as tests/test_phi3_rmsnorm.py

import pytest
import torch
from models.tt_transformers.tt.model_config import ModelArgs
from models.tt_transformers.tt.llama_transformer import RMSNorm  # Reuse from Llama

def test_rmsnorm():
    # Model dimensions for Phi-3
    hidden_dim = 3072

    # Create TT-NN version
    tt_norm = RMSNorm(device, dim=hidden_dim)

    # Create reference PyTorch version
    ref_norm = torch.nn.RMSNorm(hidden_dim)

    # Generate random input
    x = torch.randn(1, 1, hidden_dim)

    # Compare outputs
    tt_out = tt_norm(x)
    ref_out = ref_norm(x)

    # Check PCC (Pearson Correlation Coefficient)
    pcc = compute_pcc(tt_out, ref_out)
    assert pcc > 0.99, f"RMSNorm PCC too low: {pcc}"

Test each module:

Run unit tests:

pytest tests/test_phi3_rmsnorm.py -v
pytest tests/test_phi3_attention.py -v
pytest tests/test_phi3_mlp.py -v

3.3 Compose Full Decoder

Once all modules pass, test the full decoder:

# tests/test_phi3_decoder.py

def test_full_decoder():
    # Create decoder with all modules
    decoder = TransformerBlock(
        mesh_device,
        model_args,
        layer_num=0
    )

    # Generate random activations and real weights
    x = torch.randn(32, 1, 3072)  # batch=32, seq=1, hidden=3072

    # Run through TT decoder
    tt_output = decoder(x, ...)

    # Run through reference decoder
    ref_output = reference_decoder(x, ...)

    # Check PCC
    pcc = compute_pcc(tt_output, ref_output)
    assert pcc > 0.98

3.4 Handle Model-Specific Modifications

For Phi-3, the main modification was RoPE scaling:

File: models/tt_transformers/tt/rope.py

# Original: Single scaling factor
scale = 1.0 / rope_scaling_factor

# Phi-3 modification: Support scaling tensor
if isinstance(rope_scaling, dict) and "long_factor" in rope_scaling:
    # SUlongRoPE uses different scales for different frequencies
    scale_tensor = compute_longrope_scale(rope_scaling)
else:
    scale = 1.0 / rope_scaling_factor

File: models/tt_transformers/tt/model_config.py

# Add Phi-3 detection
if "Phi-3" in self.model_name:
    # Set prefill chunk size for long context
    self.min_prefill_chunk_size = 1024  # Lower for N150

File: models/tt_transformers/tt/common.py

# Batch padding normalization
# Old: Pad each prompt independently to nearest power of 2
# New: Pad all prompts to max length across batch
max_len = max(len(p) for p in prompts)
padded_len = next_power_of_2(max_len)

Key insight: These are MINIMAL changes. Most of the implementation is reused from Llama!


Phase 4: Full Model Integration

4.1 Implement Prefill

After decode works, add prefill (process initial prompt):

Prefill is more complex:

# tests/test_phi3_prefill.py

def test_prefill():
    # Long prompt (e.g., 2048 tokens)
    prompt = "..." * 2048

    # Run prefill
    logits = model.prefill(prompt)

    # Compare with reference
    ref_logits = reference_model.prefill(prompt)

    pcc = compute_pcc(logits, ref_logits)
    assert pcc > 0.98

4.2 End-to-End Testing

Test full generation (prefill + decode):

# Run full demo
export HF_MODEL=microsoft/Phi-3-mini-128k-instruct
pytest models/tt_transformers/demo/simple_text_demo.py -k "batch-1"

What to check:

  1. Does it generate coherent text?
  2. Token accuracy: Compare generated tokens to reference
  3. Top-1/Top-5 accuracy: Measure against CPU baseline

4.3 Teacher Forcing Validation

Gold standard for accuracy testing:

# Teacher forcing: Force the model to use reference tokens at each step
# This isolates per-token accuracy without error accumulation

def test_teacher_forcing():
    reference_tokens = [1, 234, 567, ...]  # From reference model

    for i, ref_token in enumerate(reference_tokens):
        # Generate next token
        predicted_token = model.generate_token()

        # Check if it matches reference
        if predicted_token == ref_token:
            top1_matches += 1

        if ref_token in model.get_top_k(5):
            top5_matches += 1

        # Force feed the reference token (teacher forcing)
        model.forward(ref_token)

    top1_accuracy = top1_matches / len(reference_tokens)
    top5_accuracy = top5_matches / len(reference_tokens)

    assert top1_accuracy > 0.80  # Bounty requirement
    assert top5_accuracy > 0.95  # Bounty requirement

Phase 5: Performance Optimization

5.1 Measure Baseline Performance

# Run performance test
pytest models/tt_transformers/demo/simple_text_demo.py \
  -k "performance and batch-32" \
  --max_generated_tokens 200

Key metrics:

For Phi-3 on N150:

5.2 Apply Optimizations

Precision tuning:

# Try different precision configurations
pytest models/tt_transformers/demo/simple_text_demo.py \
  -k "performance and batch-32" \
  --optimizations 'precision_cfg = {ff1_3: bfp4, ff2: bfp4, wqkv: bfp8, wo: bfp8}'

Create custom decoder config:

// models/tt_transformers/model_params/Phi-3-mini-128k-instruct/performance_decoder_config.json
{
  "decoder_0": {
    "ff1_3": "bfp4",
    "ff2": "bfp4",
    "wqkv": "bfp8",
    "wo": "bfp8"
  },
  // ... remaining decoders
}

Advanced optimizations:

See: Advanced Performance Optimizations

5.3 Profile and Debug

Use Tracy profiler:

# Build with Tracy support
cmake -B build -DENABLE_TRACY=ON
cmake --build build

# Run with profiling
pytest models/tt_transformers/demo/simple_text_demo.py -k "batch-32"

Analyze bottlenecks:


Phase 6: Testing & CI Integration

6.1 Create Test Suite

Required tests for bounty submission:

# Accuracy test
pytest models/tt_transformers/tests/test_accuracy.py -k "phi3"

# Performance test
pytest models/tt_transformers/tests/test_perf.py -k "phi3"

# Demo test (end-to-end)
pytest models/tt_transformers/demo/simple_text_demo.py -k "phi3"

6.2 Generate Reference Logits

For CI accuracy validation:

# Generate reference outputs for CI
python models/tt_transformers/tests/generate_reference_hf.py \
  --model microsoft/Phi-3-mini-128k-instruct \
  --output reference_outputs/Phi-3-mini-128k-instruct.refpt

6.3 Add CI Configuration

Mark tests for CI execution:

# tests/test_ci_dispatch.py

@pytest.mark.parametrize(
    "model_name",
    ["Llama-3.1-8B-Instruct", "Phi-3-mini-128k-instruct"],  # Add your model
)
def test_model_demo(model_name):
    # CI will run this test on every commit
    ...

Phase 7: Documentation & Submission

7.1 Document Your Work

Create or update README:

# Phi-3-mini-128k-instruct on Tenstorrent Hardware

## Overview
- Model: microsoft/Phi-3-mini-128k-instruct (3.8B parameters)
- Hardware: N150 / N300 / LoudBox
- Performance: 28 tokens/second/user on N150 (58% of theoretical max)

## Installation
\`\`\`bash
export HF_MODEL=microsoft/Phi-3-mini-128k-instruct
pip install -r models/tt_transformers/requirements.txt
\`\`\`

## Running
\`\`\`bash
# Single user
pytest models/tt_transformers/demo/simple_text_demo.py -k "batch-1"

# Batch of 32 users
pytest models/tt_transformers/demo/simple_text_demo.py -k "batch-32"
\`\`\`

## Performance
| Hardware | Batch Size | Throughput (t/s/u) | TTFT (ms) |
|----------|------------|-------------------|-----------|
| N150     | 1          | 15.2              | 120       |
| N150     | 32         | 28.4              | 180       |
| N300     | 32         | 42.1              | 95        |

## Accuracy
- Top-1: 84.3%
- Top-5: 96.7%
(Tested on 512-token prefill + 511-token generation)

7.2 Submit Pull Request

PR checklist:

PR structure (follow MODEL_ADD.md recommendations):

Option A: Single PR (small changes)

PR #1: Phi-3 model integration
- Core model code
- Unit tests
- Demo test
- Documentation

Option B: Multi-PR (large changes)

PR #1: Phi-3 core model code + component tests
  → Run: post-commit + models nightly

PR #2: Phi-3 performance tests
  → Run: model perf + device perf

PR #3: Phi-3 demo test
  → Run: demo tests

PR description template:

## Summary
Adds support for microsoft/Phi-3-mini-128k-instruct on N150/N300 hardware.

Closes #19416 (bounty issue)

## Changes
- Modified `rope.py` to support SUlongRoPE scaling
- Updated `model_config.py` for Phi-3 detection
- Added batch padding normalization in `common.py`

## Testing
- [x] Unit tests pass (test_phi3_*.py)
- [x] Accuracy test passes (84.3% top-1, 96.7% top-5)
- [x] Performance test passes (28 t/s/u on N150 = 58% theoretical)
- [x] Demo generates coherent text

## Performance
| Device | Throughput | Tier Achieved |
|--------|-----------|---------------|
| N150   | 28 t/s/u  | Medium (58% of theoretical) |

## Accuracy
- Top-1: 84.3% ✅ (>80% required)
- Top-5: 96.7% ✅ (>95% required)

7.3 Respond to Review Feedback

Common reviewer requests:

  1. Make changes more general - Can this work for other models too?
  2. Reduce code duplication - Can you reuse existing functions?
  3. Add test coverage - Missing edge cases?
  4. Fix CI failures - Rebase on latest main

Example from Phi-3 review:

Reviewer: "Does this need to be restricted to Phi-3-mini?"
Response: "Good point! Updated to support all Phi-3 variants (3.5, 4) with long_factor RoPE scaling."

Applying This to Other Bounties

Lesson Applicability

The Phi-3 workflow applies to:

1. ✅ Transformer-Based LLMs

Examples:

Strategy:

2. ✅ Vision Transformers

Examples:

Strategy:

3. ✅ Diffusion Models

Examples:

Strategy:

4. ⚠️ Novel Architectures (Harder)

Examples:

Strategy:


Pro Tips from Successful Contributors

1. Start Small

2. Communicate Early and Often

3. Reuse, Don't Reinvent

4. Test Incrementally

5. Profile Early

6. Document as You Go

7. Break Up Large PRs


Common Pitfalls to Avoid

❌ Don't Copy-Paste Entire Codebases

Why it fails:

Do instead:

❌ Don't Skip Baseline Validation

Why it fails:

Do instead:

❌ Don't Optimize Prematurely

Why it fails:

Do instead:

❌ Don't Ignore CI Failures

Why it fails:

Do instead:


Resources

Official Documentation

Example PRs

Community


Next Steps

Ready to make your first contribution? Try the hands-on example:

🚀 Browse Open Bounties on GitHub

📋 Copy Bounty Workflow Checklist


Summary

You've learned:

The workflow transfers to:

Key principle: Start simple, test incrementally, reuse code, communicate often.


Ready to contribute? 🎯

The Tenstorrent community is welcoming to newcomers. Start with a warmup task, learn the workflow, then scale up to more challenging contributions. Your work will run on cutting-edge AI hardware, become part of the open-source ecosystem, and help advance the field. The real reward is owning and being part of your own open future.