N150 N300 T3K P100 P150 P300C Galaxy 20 min Blocked

Model Architecture Basics

Understand the building blocks of transformer models before training from scratch. This conceptual lesson prepares you for CT-8.

What You'll Learn

Time: 20 minutes | Prerequisites: CT-1 through CT-6


Why Learn Architecture?

You've Fine-Tuned, Now What?

In CT-4, you fine-tuned TinyLlama without thinking about its internals. That works for most use cases!

But to train from scratch (CT-8), you need to understand:

This lesson is your architecture primer.


The Transformer Architecture (High Level)

Input → Output Flow

graph TD
    A[Text Input: Hello world] --> B[Tokenization]
    B --> C[Token IDs: 15496, 1917]
    C --> D[Embedding Layer]
    D --> E[Add Positional Encoding]
    E --> F[Transformer Block 1]
    F --> G[Transformer Block 2]
    G --> H[... N blocks ...]
    H --> I[Transformer Block N]
    I --> J[Output Layer]
    J --> K[Next Token Probabilities]
    K --> L[Detokenization]
    L --> M[Text Output]

    style F fill:#e1f5ff,stroke:#333,stroke-width:2px
    style G fill:#e1f5ff,stroke:#333,stroke-width:2px
    style I fill:#e1f5ff,stroke:#333,stroke-width:2px

Key insight: Most of the "magic" happens in the transformer blocks, repeated N times.

Inside a Transformer Block

Each transformer block contains:

graph TD
    A[Input from Previous Blockor Embeddings] --> B[RMSNorm 1Normalize]
    B --> C[Multi-HeadSelf-AttentionContext awareness]
    C --> D[Residual ConnectionAdd input]
    D --> E[RMSNorm 2Normalize]
    E --> F[Feed-ForwardNetworkProcess individually]
    F --> G[Residual ConnectionAdd pre-FFN state]
    G --> H[Output to Next Blockor Output Layer]

    I[Skip Connections] -.-> D
    I -.-> G

    style A fill:#4A90E2,stroke:#333,stroke-width:2px
    style C fill:#7B68EE,stroke:#333,stroke-width:2px
    style F fill:#50C878,stroke:#333,stroke-width:2px
    style H fill:#4A90E2,stroke:#333,stroke-width:2px
    style D fill:#E85D75,stroke:#333,stroke-width:2px
    style G fill:#E85D75,stroke:#333,stroke-width:2px

Key components:

  1. RMSNorm - Stabilize values
  2. Multi-Head Attention - Learn context
  3. Residual Connections - Enable deep networks (prevent vanishing gradients)
  4. Feed-Forward Network - Transform representations

This block repeats N times (6 for nano-trickster, 22 for TinyLlama).


Component 1: Tokenization

What Is a Token?

Token: A piece of text the model can process.

Options:

  1. Character-level: Each character is a token

    • "Hello"['H', 'e', 'l', 'l', 'o']
    • Pros: Small vocabulary (26 letters + punctuation)
    • Cons: Long sequences (every character counts)
  2. Word-level: Each word is a token

    • "Hello world"['Hello', 'world']
    • Pros: Meaningful units
    • Cons: Huge vocabulary (every word needs an ID)
  3. Subword (BPE/WordPiece): Hybrid approach

    • "unbelievable"['un', 'believ', 'able']
    • Pros: Balance vocabulary size and sequence length
    • Cons: More complex to train

TinyLlama uses BPE (Byte-Pair Encoding): 32,000 token vocabulary.

Why It Matters for Training

Vocabulary size = first layer size:

Trade-off:


Component 2: Embeddings

What Is an Embedding?

Embedding: Convert token IDs (integers) to dense vectors (floats).

Token ID: 1234
    ↓
Embedding Layer (lookup table)
    ↓
Vector: [0.23, -0.45, 0.12, ..., 0.67]  # size = hidden_dim

Example:

This is often the largest single layer!

Token Embeddings vs Position Embeddings

Token embedding: What is the token?

Position embedding: Where is the token?

Combined: token_embedding + position_embedding

This tells the model both what the word is and where it appears.


Component 3: Self-Attention

The Core Idea

Self-Attention: Let each word look at every other word to understand context.

Example:

Sentence: "The cat sat on the mat"

When processing "sat":
- Look at "The" → not very relevant (weight: 0.1)
- Look at "cat" → very relevant! (weight: 0.9)
- Look at "on" → somewhat relevant (weight: 0.3)
- Look at "mat" → relevant (weight: 0.5)
graph LR
    subgraph "Self-Attention for 'sat'"
        A[Theweight: 0.1] -.-> E[sat]
        B[catweight: 0.9] ==> E
        C[onweight: 0.3] --> E
        D[matweight: 0.5] --> E
        E --> F[Context-aware'sat' embedding]
    end

    style B fill:#50C878,stroke:#333,stroke-width:2px
    style E fill:#4A90E2,stroke:#333,stroke-width:2px
    style F fill:#7B68EE,stroke:#333,stroke-width:2px

The model learns these weights during training.

Query, Key, Value (QKV)

Think of it like a search engine:

  1. Query: What am I looking for?

    • "sat" asks: "What's the subject?"
  2. Key: What can I offer?

    • "cat" says: "I'm a noun, I can be a subject!"
  3. Value: What information do I have?

    • "cat" provides its semantic meaning

Math (simplified):

attention_weight = softmax(Query · Key)
output = attention_weight · Value
graph TD
    A[Input Word Embedding] --> B1[Query Matrix W_Q]
    A --> B2[Key Matrix W_K]
    A --> B3[Value Matrix W_V]

    B1 --> C1[Query Vector]
    B2 --> C2[Key Vector]
    B3 --> C3[Value Vector]

    C1 & C2 --> D[Compute Attention ScoresQuery · Key^T]
    D --> E[SoftmaxGet Attention Weights]
    E & C3 --> F[Weighted SumAttention × Value]
    F --> G[Context-Aware Output]

    style A fill:#4A90E2,stroke:#333,stroke-width:2px
    style G fill:#7B68EE,stroke:#333,stroke-width:2px

Parameters:

Multi-Head Attention

Instead of one attention mechanism, use multiple in parallel:

graph TD
    A[Input Embeddinghidden_dim = 256] --> B[Split into 8 Heads32 dims each]

    B --> H1[Head 1Syntax patternsQ/K/V: 32×32]
    B --> H2[Head 2Semantic relationsQ/K/V: 32×32]
    B --> H3[Head 3Long-range depsQ/K/V: 32×32]
    B --> H4[Head 4-8Other patternsQ/K/V: 32×32]

    H1 --> C[Concatenate Results8 heads × 32 = 256]
    H2 --> C
    H3 --> C
    H4 --> C

    C --> D[Output Projection256 → 256]
    D --> E[Context-Rich Embedding]

    style A fill:#4A90E2,stroke:#333,stroke-width:2px
    style E fill:#7B68EE,stroke:#333,stroke-width:2px
    style H1 fill:#50C878,stroke:#333,stroke-width:2px
    style H2 fill:#50C878,stroke:#333,stroke-width:2px
    style H3 fill:#50C878,stroke:#333,stroke-width:2px
    style H4 fill:#50C878,stroke:#333,stroke-width:2px

Why multiple heads?

Parameters:


Component 4: Feed-Forward Networks

What Does It Do?

After attention tells us which words matter, the feed-forward network processes each word individually.

Structure:

graph TD
    A[Input from Attentionhidden_dim = 256] --> B[Linear Layer 1256 → 1024262K params]
    B --> C[Activation FunctionSwiGLU or ReLUNon-linearity]
    C --> D[Linear Layer 21024 → 256262K params]
    D --> E[Outputhidden_dim = 256]

    F[Parameter Breakdown] --> G[Layer 1: 256 × 1024 = 262K]
    F --> H[Layer 2: 1024 × 256 = 262K]
    F --> I[Total: 524K params per FFN]

    style A fill:#4A90E2,stroke:#333,stroke-width:2px
    style E fill:#7B68EE,stroke:#333,stroke-width:2px
    style C fill:#E85D75,stroke:#333,stroke-width:2px
    style G fill:#6C757D,stroke:#333,stroke-width:2px
    style H fill:#6C757D,stroke:#333,stroke-width:2px
    style I fill:#50C878,stroke:#333,stroke-width:2px

Typical sizing:

Parameters:

Why It Matters

Feed-forward networks are where most parameters live in large models:

Trade-off:


Component 5: Normalization

Why Normalize?

Problem: As you stack layers, activations can explode or vanish.

Solution: Normalize after each sub-layer.

Two common approaches:

  1. LayerNorm (older models like GPT-2):

    normalized = (x - mean) / std
    
  2. RMSNorm (modern models like TinyLlama):

    normalized = x / rms(x)
    

RMSNorm is faster and works just as well.

Parameters:


Component 6: Output Layer

From Hidden States to Predictions

Final step: Convert hidden vectors back to token probabilities.

Hidden state: [0.23, -0.45, ..., 0.67]  # size = hidden_dim
    ↓
Linear layer (hidden_dim → vocab_size)
    ↓
Softmax
    ↓
Probabilities: [0.01, 0.02, ..., 0.85]  # size = vocab_size

Parameters:

Often ties weights with embedding layer to save parameters:


Putting It All Together: TinyLlama

Architecture Summary

TinyLlama-1.1B:
  vocab_size: 32,000
  hidden_dim: 2048
  num_layers: 22
  num_heads: 32
  mlp_dim: 5632  # ~2.75 × hidden_dim
  max_seq_len: 2048

Parameter Breakdown

Per transformer block:

Full model:

graph TD
    A[TinyLlama-1.1BParameter Distribution] --> B[Embedding Layer65.5M6%]
    A --> C[22 Transformer Blocks880M total80%]
    A --> D[Output Layer65.5M shared6%]

    C --> E[Per Block: 40M params]
    E --> F[Multi-Head Attention16.8M42% of block]
    E --> G[Feed-Forward Network23.1M58% of block]
    E --> H[Normalization~4Knegligible]

    style A fill:#4A90E2,stroke:#333,stroke-width:2px
    style B fill:#7B68EE,stroke:#333,stroke-width:2px
    style C fill:#50C878,stroke:#333,stroke-width:2px
    style D fill:#7B68EE,stroke:#333,stroke-width:2px
    style F fill:#E85D75,stroke:#333,stroke-width:2px
    style G fill:#DDA0DD,stroke:#333,stroke-width:2px

Key insight: ~70% of all parameters are in the feed-forward networks!

Why This Matters

For fine-tuning (CT-4):

For training from scratch (CT-8):


Designing Your Own Architecture

The Scaling Laws

Rule of thumb for compute:

Training cost ∝ (num_params) × (num_tokens) × (context_length)

Trade-offs:

Parameter Effect if Increased Cost if Increased
hidden_dim More expressive embeddings All layers bigger
num_layers Deeper understanding Linear scaling
num_heads Richer attention patterns Minimal (heads are split)
mlp_dim More capacity per layer Significant (most params)
vocab_size Better tokenization Bigger embedding/output

Example: Nano-Trickster (CT-8)

Goal: Build a 10-20M parameter model for N150.

Design:

nano-trickster:
  vocab_size: 256        # Character-level (simple!)
  hidden_dim: 256        # Small but workable
  num_layers: 6          # Shallow (6× faster than TinyLlama)
  num_heads: 8           # Decent parallelism
  mlp_dim: 768           # 3× hidden_dim
  max_seq_len: 512       # Short context (fine for our task)

Parameter count:

graph LR
    A[Model Size Comparison] --> B[Nano-Trickster11M params]
    A --> C[TinyLlama1.1B params]

    B --> B1[vocab: 256char-level]
    B --> B2[hidden: 256small]
    B --> B3[layers: 6shallow]
    B --> B4[Training: 30-60 minN150 ✓]

    C --> C1[vocab: 32,000BPE]
    C --> C2[hidden: 2048large]
    C --> C3[layers: 22deep]
    C --> C4[Training: Many hoursN300+ recommended]

    style B fill:#7B68EE,stroke:#333,stroke-width:2px
    style C fill:#4A90E2,stroke:#333,stroke-width:2px
    style B4 fill:#50C878,stroke:#333,stroke-width:2px
    style C4 fill:#E85D75,stroke:#333,stroke-width:2px

Why this works:


Memory and Compute Considerations

Memory Requirements

Model size (inference):

memory = num_params × bytes_per_param

For BF16 (2 bytes): 1.1B params = 2.2GB

Training memory (much higher):

memory = num_params × (
    2 bytes (model weights) +
    2 bytes (gradients) +
    8 bytes (optimizer state, e.g., AdamW) +
    4 bytes (activations per layer per token)
)

For 1.1B params + batch_size=8 + seq_len=512:

graph LR
    A[Training Memoryfor 1.1B params] --> B[Model Weights2.2GBBF16 format]
    A --> C[Gradients2.2GBsame size as weights]
    A --> D[Optimizer State8.8GBAdamW momentum]
    A --> E[Activations4GBbatch × layers]

    F[Total: ~17GB] --> G[N150: TightDRAM limits]
    F --> H[N300: ComfortableDistributed memory]
    F --> I[Nano-model 10-20M: Easy~200MB total]

    style A fill:#4A90E2,stroke:#333,stroke-width:2px
    style D fill:#E85D75,stroke:#333,stroke-width:2px
    style G fill:#FF6B6B,stroke:#333,stroke-width:2px
    style H fill:#50C878,stroke:#333,stroke-width:2px
    style I fill:#7B68EE,stroke:#333,stroke-width:2px

This is why:

Compute Bottlenecks

Where time is spent during training:

  1. Attention: ~30% (sequence_length² operations)
  2. Feed-forward: ~60% (matrix multiplications)
  3. Other: ~10% (normalization, activations, etc.)
pie title Training Time Distribution
    "Feed-Forward Networks" : 60
    "Attention Mechanisms" : 30
    "Other (Norm, Activation)" : 10
graph TD
    A[Scaling Impacts] --> B[Double Sequence Lengthseq_len: 512 → 1024]
    B --> B1[Attention Cost: 4×quadratic scaling]
    B --> B2[FFN Cost: Sameno sequence dependency]

    A --> C[Double Hidden Dimensionhidden_dim: 256 → 512]
    C --> C1[Attention Cost: 4×QKV matrices scale]
    C --> C2[FFN Cost: 4×matrix sizes scale]

    A --> D[Double Num Layersnum_layers: 6 → 12]
    D --> D1[All Costs: 2×linear scaling]

    style B1 fill:#E85D75,stroke:#333,stroke-width:2px
    style C1 fill:#E85D75,stroke:#333,stroke-width:2px
    style C2 fill:#E85D75,stroke:#333,stroke-width:2px
    style D1 fill:#50C878,stroke:#333,stroke-width:2px

Scaling considerations:


Key Architectural Innovations

Why Modern Models Use These

RoPE (Rotary Position Embeddings):

SwiGLU (Gated Linear Units):

RMSNorm:

Multi-Query Attention (MQA) / Grouped-Query Attention (GQA):


Practical Implications for Training

From CT-4 (Fine-tuning) to CT-8 (From Scratch)

Fine-tuning (what you did in CT-4):

# Load pre-trained model
model = load_pretrained("TinyLlama-1.1B")

# All architecture decisions already made:
# - 22 layers
# - 2048 hidden_dim
# - 32 attention heads
# - etc.

# Just adjust weights
train(model, your_dataset)

Training from scratch (CT-8):

# YOU decide the architecture
model = TransformerModel(
    vocab_size=256,      # Your choice!
    hidden_dim=256,      # Your choice!
    num_layers=6,        # Your choice!
    num_heads=8,         # Your choice!
    mlp_dim=768,         # Your choice!
)

# Initialize weights randomly
model.init_weights()

# Train from zero
train(model, your_dataset)

Key difference: You control every architectural decision.

graph TD
    A[Training Approaches] --> B[Fine-TuningCT-4]
    A --> C[From ScratchCT-8]

    B --> B1[Start: Pre-trained ModelTinyLlama 1.1BAlready knows language]
    B1 --> B2[Architecture: Fixed22 layers, 2048 hiddenCan't change structure]
    B2 --> B3[Training: Fast500-1000 steps1-3 hours on N150]
    B3 --> B4[Result: SpecializedKeeps general knowledgeAdds new behavior]

    C --> C1[Start: Random WeightsBlank slateKnows nothing]
    C1 --> C2[Architecture: Your Choice6 layers, 256 hiddenYou design everything]
    C2 --> C3[Training: Longer5000-10000 stepsMany hours]
    C3 --> C4[Result: CustomLearns from data onlyTailored to task]

    style B fill:#7B68EE,stroke:#333,stroke-width:2px
    style C fill:#4A90E2,stroke:#333,stroke-width:2px
    style B3 fill:#50C878,stroke:#333,stroke-width:2px
    style C3 fill:#E85D75,stroke:#333,stroke-width:2px

Common Architecture Mistakes

❌ Don't: Make Everything Big

# This will OOM on N150 and train forever
bad-design:
  hidden_dim: 4096    # Too big!
  num_layers: 24      # Too many!
  mlp_dim: 16384      # Way too big!
  # Result: 2B+ parameters

✅ Do: Start Small, Scale Up

# This will work on N150
good-design:
  hidden_dim: 256     # Reasonable
  num_layers: 6       # Manageable
  mlp_dim: 768        # 3× hidden_dim
  # Result: ~11M parameters

❌ Don't: Use Incompatible Dimensions

bad-design:
  hidden_dim: 256
  num_heads: 7        # Not a divisor of 256!
  # Error: hidden_dim must be divisible by num_heads

✅ Do: Keep Dimensions Compatible

good-design:
  hidden_dim: 256
  num_heads: 8        # 256 / 8 = 32 (perfect!)

Architecture Cheat Sheet

For Quick Reference

Component Typical Range Nano-Trickster (CT-8) TinyLlama
vocab_size 256-50,000 256 (char-level) 32,000 (BPE)
hidden_dim 128-4096 256 2048
num_layers 4-32 6 22
num_heads 4-32 8 32
mlp_dim 2-4× hidden 768 (3×) 5632 (2.75×)
max_seq_len 128-4096 512 2048
Total params - ~11M ~1.1B
Training time (N150) - 30-60 min Many hours

Beyond This Lesson: Architecting the Future of AI

You've learned how transformers work under the hood. But what can you build with this architectural knowledge? Let's explore how understanding architecture unlocks the ability to design specialized models that solve real problems.

What Developers Have Designed

Real architectures built by developers who understood the fundamentals:

🎯 "Code Completion Specialist" (Startup engineer)

🔬 "Protein Sequence Analyzer" (Biotech researcher)

🎮 "Game Dialogue Generator" (Indie studio)

💼 "Legal Document Parser" (LegalTech company)

Specialized Architectures Beat General Models

Why architectural choices matter more than you think:

📊 Medical Q&A Model (100M params, specialized)

🔧 Hardware Verilog Generator (20M params)

📝 Meeting Notes Summarizer (40M params)

Architectural Patterns to Learn From

Design patterns that solve real problems:

🚀 Tiny Transformers (1-50M params) When to use:

Architecture choices:

🎯 Long-Context Transformers (50-500M params) When to use:

Architecture choices:

🔬 Domain-Specific Transformers (20-200M params) When to use:

Architecture choices:

💡 Efficient Inference Transformers (10-100M params) When to use:

Architecture choices:

Your Architecture Design Journey

From understanding to creation:

Week 1 (Understanding - this lesson):

Week 2 (Experimentation - CT-8):

Month 2 (Specialization):

Month 3+ (Innovation):

Architectural Decisions That Changed Everything

Real examples of how architectural choices enable breakthroughs:

🌟 Rotary Position Embeddings (RoPE)

Grouped-Query Attention (GQA)

🎯 SwiGLU Activation

🔧 RMSNorm vs LayerNorm

Imagine: Models You Could Design

With your architectural knowledge, you could build:

🚀 Real-Time Code Autocomplete (5M params)

📊 Financial Report Analyzer (30M params)

🎨 Style Transfer Text Rewriter (15M params)

🔬 Scientific Paper Summarizer (50M params)

🎮 Game Narrative Generator (20M params)

The Architecture Decision Tree

How to design your model:

graph TD
    A[What's your primary constraint?] --> B{Latency}
    A --> C{Memory}
    A --> D{Accuracy}

    B --> E[Tiny model4 layers, 128-256 hidden1-10M params]
    C --> F[Efficient model6 layers, 384 hidden10-50M params]
    D --> G[Larger model12+ layers, 768+ hidden100M+ params]

    E --> H[Real-time applicationsautocomplete, suggestions]
    F --> I[Production deploymentAPIs, mobile apps]
    G --> J[High-accuracy tasksresearch, analysis]

    K[What's your data?] --> L{Lots of data}
    K --> M{Limited data}

    L --> N[Train from scratchCustom architecture]
    M --> O[Fine-tune existingAdapt architecture]

    style E fill:#50C878,stroke:#333,stroke-width:2px
    style F fill:#7B68EE,stroke:#333,stroke-width:2px
    style G fill:#E85D75,stroke:#333,stroke-width:2px

Your design process:

  1. Define constraints (latency, memory, accuracy requirements)
  2. Choose base architecture (decoder-only, encoder-decoder, etc.)
  3. Size the model (layers, hidden_dim based on task complexity)
  4. Select components (attention type, activation, normalization)
  5. Iterate (train small, evaluate, adjust)

From CT-7 to CT-8: Your Design in Action

What you'll do in the next lesson:

The progression:

You now have:

The question isn't "Can I design a custom architecture?"

The question is "What specialized model will I design first?"

Imagine:

Architecture isn't just theory. It's power to build exactly what you need.


Key Takeaways

Transformers have 6 key components: tokenization, embeddings, attention, FFN, normalization, output

Most parameters live in FFN layers (60-70% of total)

Architecture decisions affect training time and memory significantly

Start small (10-20M params), scale up when you understand the trade-offs

Modern improvements (RoPE, SwiGLU, RMSNorm) make models more efficient

hidden_dim and num_layers are your main scaling knobs


Next Steps

Lesson CT-8: Training from Scratch

You now understand the components. In CT-8, you'll:

  1. Design a nano-trickster architecture (10-20M params)
  2. Initialize it from scratch
  3. Train on tiny-shakespeare dataset
  4. See a model learn language from random initialization
  5. Compare to random baseline (prove learning happened!)

Estimated time: 30 minutes (setup) + 30-60 minutes (training) Prerequisites: CT-7 (this lesson)


Additional Resources

Papers

Interactive Visualizations

Code References


Ready to build your first model from scratch? Continue to Lesson CT-8: Training from Scratch