Model Architecture Basics
Understand the building blocks of transformer models before training from scratch. This conceptual lesson prepares you for CT-8.
What You'll Learn
- Transformer architecture components
- Tokenization (character vs BPE vs WordPiece)
- Embedding layers and positional encoding
- Attention mechanisms (self-attention, multi-head)
- Feed-forward networks
- Why these components matter for training
Time: 20 minutes | Prerequisites: CT-1 through CT-6
Why Learn Architecture?
You've Fine-Tuned, Now What?
In CT-4, you fine-tuned TinyLlama without thinking about its internals. That works for most use cases!
But to train from scratch (CT-8), you need to understand:
- What components make up a transformer
- How many parameters each component adds
- Where memory and compute are spent
- How to design a small model that fits on your hardware
This lesson is your architecture primer.
The Transformer Architecture (High Level)
Input → Output Flow
graph TD
A[Text Input: Hello world] --> B[Tokenization]
B --> C[Token IDs: 15496, 1917]
C --> D[Embedding Layer]
D --> E[Add Positional Encoding]
E --> F[Transformer Block 1]
F --> G[Transformer Block 2]
G --> H[... N blocks ...]
H --> I[Transformer Block N]
I --> J[Output Layer]
J --> K[Next Token Probabilities]
K --> L[Detokenization]
L --> M[Text Output]
style F fill:#e1f5ff,stroke:#333,stroke-width:2px
style G fill:#e1f5ff,stroke:#333,stroke-width:2px
style I fill:#e1f5ff,stroke:#333,stroke-width:2px
Key insight: Most of the "magic" happens in the transformer blocks, repeated N times.
Inside a Transformer Block
Each transformer block contains:
graph TD
A[Input from Previous Blockor Embeddings] --> B[RMSNorm 1Normalize]
B --> C[Multi-HeadSelf-AttentionContext awareness]
C --> D[Residual ConnectionAdd input]
D --> E[RMSNorm 2Normalize]
E --> F[Feed-ForwardNetworkProcess individually]
F --> G[Residual ConnectionAdd pre-FFN state]
G --> H[Output to Next Blockor Output Layer]
I[Skip Connections] -.-> D
I -.-> G
style A fill:#4A90E2,stroke:#333,stroke-width:2px
style C fill:#7B68EE,stroke:#333,stroke-width:2px
style F fill:#50C878,stroke:#333,stroke-width:2px
style H fill:#4A90E2,stroke:#333,stroke-width:2px
style D fill:#E85D75,stroke:#333,stroke-width:2px
style G fill:#E85D75,stroke:#333,stroke-width:2px
Key components:
- RMSNorm - Stabilize values
- Multi-Head Attention - Learn context
- Residual Connections - Enable deep networks (prevent vanishing gradients)
- Feed-Forward Network - Transform representations
This block repeats N times (6 for nano-trickster, 22 for TinyLlama).
Component 1: Tokenization
What Is a Token?
Token: A piece of text the model can process.
Options:
Character-level: Each character is a token
"Hello"→['H', 'e', 'l', 'l', 'o']- Pros: Small vocabulary (26 letters + punctuation)
- Cons: Long sequences (every character counts)
Word-level: Each word is a token
"Hello world"→['Hello', 'world']- Pros: Meaningful units
- Cons: Huge vocabulary (every word needs an ID)
Subword (BPE/WordPiece): Hybrid approach
"unbelievable"→['un', 'believ', 'able']- Pros: Balance vocabulary size and sequence length
- Cons: More complex to train
TinyLlama uses BPE (Byte-Pair Encoding): 32,000 token vocabulary.
Why It Matters for Training
Vocabulary size = first layer size:
- 32,000 vocab = 32,000 × hidden_dim parameters in embedding layer
- Character-level: 256 vocab (much smaller!)
- Word-level: 50,000+ vocab (much larger!)
Trade-off:
- Small vocab → more tokens per sentence → longer sequences
- Large vocab → fewer tokens per sentence → bigger embedding layer
Component 2: Embeddings
What Is an Embedding?
Embedding: Convert token IDs (integers) to dense vectors (floats).
Token ID: 1234
↓
Embedding Layer (lookup table)
↓
Vector: [0.23, -0.45, 0.12, ..., 0.67] # size = hidden_dim
Example:
- Vocab size: 32,000 tokens
- Hidden dim: 256
- Embedding parameters: 32,000 × 256 = 8.2M parameters
This is often the largest single layer!
Token Embeddings vs Position Embeddings
Token embedding: What is the token?
"cat"→[0.1, 0.9, ...](semantic meaning)
Position embedding: Where is the token?
- Position 0 →
[1.0, 0.0, ...] - Position 1 →
[0.9, 0.1, ...]
Combined: token_embedding + position_embedding
This tells the model both what the word is and where it appears.
Component 3: Self-Attention
The Core Idea
Self-Attention: Let each word look at every other word to understand context.
Example:
Sentence: "The cat sat on the mat"
When processing "sat":
- Look at "The" → not very relevant (weight: 0.1)
- Look at "cat" → very relevant! (weight: 0.9)
- Look at "on" → somewhat relevant (weight: 0.3)
- Look at "mat" → relevant (weight: 0.5)
graph LR
subgraph "Self-Attention for 'sat'"
A[Theweight: 0.1] -.-> E[sat]
B[catweight: 0.9] ==> E
C[onweight: 0.3] --> E
D[matweight: 0.5] --> E
E --> F[Context-aware'sat' embedding]
end
style B fill:#50C878,stroke:#333,stroke-width:2px
style E fill:#4A90E2,stroke:#333,stroke-width:2px
style F fill:#7B68EE,stroke:#333,stroke-width:2px
The model learns these weights during training.
Query, Key, Value (QKV)
Think of it like a search engine:
Query: What am I looking for?
"sat"asks: "What's the subject?"
Key: What can I offer?
"cat"says: "I'm a noun, I can be a subject!"
Value: What information do I have?
"cat"provides its semantic meaning
Math (simplified):
attention_weight = softmax(Query · Key)
output = attention_weight · Value
graph TD
A[Input Word Embedding] --> B1[Query Matrix W_Q]
A --> B2[Key Matrix W_K]
A --> B3[Value Matrix W_V]
B1 --> C1[Query Vector]
B2 --> C2[Key Vector]
B3 --> C3[Value Vector]
C1 & C2 --> D[Compute Attention ScoresQuery · Key^T]
D --> E[SoftmaxGet Attention Weights]
E & C3 --> F[Weighted SumAttention × Value]
F --> G[Context-Aware Output]
style A fill:#4A90E2,stroke:#333,stroke-width:2px
style G fill:#7B68EE,stroke:#333,stroke-width:2px
Parameters:
- 3 weight matrices (Q, K, V): 3 × hidden_dim × hidden_dim
- For hidden_dim=256: 3 × 256 × 256 = 196K parameters per attention head
Multi-Head Attention
Instead of one attention mechanism, use multiple in parallel:
graph TD
A[Input Embeddinghidden_dim = 256] --> B[Split into 8 Heads32 dims each]
B --> H1[Head 1Syntax patternsQ/K/V: 32×32]
B --> H2[Head 2Semantic relationsQ/K/V: 32×32]
B --> H3[Head 3Long-range depsQ/K/V: 32×32]
B --> H4[Head 4-8Other patternsQ/K/V: 32×32]
H1 --> C[Concatenate Results8 heads × 32 = 256]
H2 --> C
H3 --> C
H4 --> C
C --> D[Output Projection256 → 256]
D --> E[Context-Rich Embedding]
style A fill:#4A90E2,stroke:#333,stroke-width:2px
style E fill:#7B68EE,stroke:#333,stroke-width:2px
style H1 fill:#50C878,stroke:#333,stroke-width:2px
style H2 fill:#50C878,stroke:#333,stroke-width:2px
style H3 fill:#50C878,stroke:#333,stroke-width:2px
style H4 fill:#50C878,stroke:#333,stroke-width:2px
Why multiple heads?
- Each head can specialize in different patterns
- Head 1 might learn syntax, Head 2 might learn semantics
- Head 3 might capture long-range dependencies
- More expressive than single attention
Parameters:
- 8 heads × 196K = 1.57M parameters per multi-head attention layer
Component 4: Feed-Forward Networks
What Does It Do?
After attention tells us which words matter, the feed-forward network processes each word individually.
Structure:
graph TD
A[Input from Attentionhidden_dim = 256] --> B[Linear Layer 1256 → 1024262K params]
B --> C[Activation FunctionSwiGLU or ReLUNon-linearity]
C --> D[Linear Layer 21024 → 256262K params]
D --> E[Outputhidden_dim = 256]
F[Parameter Breakdown] --> G[Layer 1: 256 × 1024 = 262K]
F --> H[Layer 2: 1024 × 256 = 262K]
F --> I[Total: 524K params per FFN]
style A fill:#4A90E2,stroke:#333,stroke-width:2px
style E fill:#7B68EE,stroke:#333,stroke-width:2px
style C fill:#E85D75,stroke:#333,stroke-width:2px
style G fill:#6C757D,stroke:#333,stroke-width:2px
style H fill:#6C757D,stroke:#333,stroke-width:2px
style I fill:#50C878,stroke:#333,stroke-width:2px
Typical sizing:
mlp_dim = 4 × hidden_dim- For hidden_dim=256: mlp_dim = 1024
Parameters:
- Layer 1: 256 × 1024 = 262K
- Layer 2: 1024 × 256 = 262K
- Total: 524K parameters per FFN
Why It Matters
Feed-forward networks are where most parameters live in large models:
- TinyLlama (1.1B params): ~70% in FFN layers
- Llama-3.1 (8B params): ~75% in FFN layers
Trade-off:
- Larger mlp_dim → more expressive → more parameters
- Smaller mlp_dim → faster → less capacity
Component 5: Normalization
Why Normalize?
Problem: As you stack layers, activations can explode or vanish.
Solution: Normalize after each sub-layer.
Two common approaches:
LayerNorm (older models like GPT-2):
normalized = (x - mean) / stdRMSNorm (modern models like TinyLlama):
normalized = x / rms(x)
RMSNorm is faster and works just as well.
Parameters:
- Very few! Just a scale parameter per dimension
- For hidden_dim=256: 256 parameters
Component 6: Output Layer
From Hidden States to Predictions
Final step: Convert hidden vectors back to token probabilities.
Hidden state: [0.23, -0.45, ..., 0.67] # size = hidden_dim
↓
Linear layer (hidden_dim → vocab_size)
↓
Softmax
↓
Probabilities: [0.01, 0.02, ..., 0.85] # size = vocab_size
Parameters:
- hidden_dim × vocab_size
- For 256 × 32,000 = 8.2M parameters
Often ties weights with embedding layer to save parameters:
- Embedding: vocab → hidden
- Output: hidden → vocab
- Use same weights, transposed!
Putting It All Together: TinyLlama
Architecture Summary
TinyLlama-1.1B:
vocab_size: 32,000
hidden_dim: 2048
num_layers: 22
num_heads: 32
mlp_dim: 5632 # ~2.75 × hidden_dim
max_seq_len: 2048
Parameter Breakdown
Per transformer block:
- Multi-head attention: ~16.8M parameters
- Feed-forward network: ~23.1M parameters
- Normalization: ~4K parameters
- Total per block: ~40M parameters
Full model:
- Embedding: 65.5M
- 22 transformer blocks: 22 × 40M = 880M
- Output layer: 65.5M (weight-tied with embedding)
- Total: ~1.1B parameters
graph TD
A[TinyLlama-1.1BParameter Distribution] --> B[Embedding Layer65.5M6%]
A --> C[22 Transformer Blocks880M total80%]
A --> D[Output Layer65.5M shared6%]
C --> E[Per Block: 40M params]
E --> F[Multi-Head Attention16.8M42% of block]
E --> G[Feed-Forward Network23.1M58% of block]
E --> H[Normalization~4Knegligible]
style A fill:#4A90E2,stroke:#333,stroke-width:2px
style B fill:#7B68EE,stroke:#333,stroke-width:2px
style C fill:#50C878,stroke:#333,stroke-width:2px
style D fill:#7B68EE,stroke:#333,stroke-width:2px
style F fill:#E85D75,stroke:#333,stroke-width:2px
style G fill:#DDA0DD,stroke:#333,stroke-width:2px
Key insight: ~70% of all parameters are in the feed-forward networks!
Why This Matters
For fine-tuning (CT-4):
- You don't change the architecture
- All 1.1B parameters are there
- You just adjust their values slightly
For training from scratch (CT-8):
- You choose every number above
- Smaller numbers → faster, but less capable
- This is why we'll build a 10-20M param model!
Designing Your Own Architecture
The Scaling Laws
Rule of thumb for compute:
Training cost ∝ (num_params) × (num_tokens) × (context_length)
Trade-offs:
| Parameter | Effect if Increased | Cost if Increased |
|---|---|---|
hidden_dim |
More expressive embeddings | All layers bigger |
num_layers |
Deeper understanding | Linear scaling |
num_heads |
Richer attention patterns | Minimal (heads are split) |
mlp_dim |
More capacity per layer | Significant (most params) |
vocab_size |
Better tokenization | Bigger embedding/output |
Example: Nano-Trickster (CT-8)
Goal: Build a 10-20M parameter model for N150.
Design:
nano-trickster:
vocab_size: 256 # Character-level (simple!)
hidden_dim: 256 # Small but workable
num_layers: 6 # Shallow (6× faster than TinyLlama)
num_heads: 8 # Decent parallelism
mlp_dim: 768 # 3× hidden_dim
max_seq_len: 512 # Short context (fine for our task)
Parameter count:
- Embedding: 256 × 256 = 65K
- Per block: ~1.8M
- 6 blocks: 6 × 1.8M = 10.8M
- Output: 65K (weight-tied)
- Total: ~11M parameters
graph LR
A[Model Size Comparison] --> B[Nano-Trickster11M params]
A --> C[TinyLlama1.1B params]
B --> B1[vocab: 256char-level]
B --> B2[hidden: 256small]
B --> B3[layers: 6shallow]
B --> B4[Training: 30-60 minN150 ✓]
C --> C1[vocab: 32,000BPE]
C --> C2[hidden: 2048large]
C --> C3[layers: 22deep]
C --> C4[Training: Many hoursN300+ recommended]
style B fill:#7B68EE,stroke:#333,stroke-width:2px
style C fill:#4A90E2,stroke:#333,stroke-width:2px
style B4 fill:#50C878,stroke:#333,stroke-width:2px
style C4 fill:#E85D75,stroke:#333,stroke-width:2px
Why this works:
- Fits easily on N150 (low memory)
- Trains in 30-60 minutes (fast iteration)
- Large enough to learn patterns (not a toy)
- Small enough to understand (debuggable)
Memory and Compute Considerations
Memory Requirements
Model size (inference):
memory = num_params × bytes_per_param
For BF16 (2 bytes): 1.1B params = 2.2GB
Training memory (much higher):
memory = num_params × (
2 bytes (model weights) +
2 bytes (gradients) +
8 bytes (optimizer state, e.g., AdamW) +
4 bytes (activations per layer per token)
)
For 1.1B params + batch_size=8 + seq_len=512:
- Model + gradients + optimizer: ~13GB
- Activations: ~4GB
- Total: ~17GB
graph LR
A[Training Memoryfor 1.1B params] --> B[Model Weights2.2GBBF16 format]
A --> C[Gradients2.2GBsame size as weights]
A --> D[Optimizer State8.8GBAdamW momentum]
A --> E[Activations4GBbatch × layers]
F[Total: ~17GB] --> G[N150: TightDRAM limits]
F --> H[N300: ComfortableDistributed memory]
F --> I[Nano-model 10-20M: Easy~200MB total]
style A fill:#4A90E2,stroke:#333,stroke-width:2px
style D fill:#E85D75,stroke:#333,stroke-width:2px
style G fill:#FF6B6B,stroke:#333,stroke-width:2px
style H fill:#50C878,stroke:#333,stroke-width:2px
style I fill:#7B68EE,stroke:#333,stroke-width:2px
This is why:
- N150 is tight for 1.1B models (DRAM limits)
- N300 gives more headroom (distributed memory)
- Smaller models (10-20M) train comfortably on N150
Compute Bottlenecks
Where time is spent during training:
- Attention: ~30% (sequence_length² operations)
- Feed-forward: ~60% (matrix multiplications)
- Other: ~10% (normalization, activations, etc.)
pie title Training Time Distribution
"Feed-Forward Networks" : 60
"Attention Mechanisms" : 30
"Other (Norm, Activation)" : 10
graph TD
A[Scaling Impacts] --> B[Double Sequence Lengthseq_len: 512 → 1024]
B --> B1[Attention Cost: 4×quadratic scaling]
B --> B2[FFN Cost: Sameno sequence dependency]
A --> C[Double Hidden Dimensionhidden_dim: 256 → 512]
C --> C1[Attention Cost: 4×QKV matrices scale]
C --> C2[FFN Cost: 4×matrix sizes scale]
A --> D[Double Num Layersnum_layers: 6 → 12]
D --> D1[All Costs: 2×linear scaling]
style B1 fill:#E85D75,stroke:#333,stroke-width:2px
style C1 fill:#E85D75,stroke:#333,stroke-width:2px
style C2 fill:#E85D75,stroke:#333,stroke-width:2px
style D1 fill:#50C878,stroke:#333,stroke-width:2px
Scaling considerations:
- Double sequence length → 4× attention cost
- Double hidden_dim → 4× FFN cost
- Double num_layers → 2× everything
Key Architectural Innovations
Why Modern Models Use These
RoPE (Rotary Position Embeddings):
- Better than learned position embeddings
- Generalizes to longer sequences than trained on
- Used by: Llama, TinyLlama, many others
SwiGLU (Gated Linear Units):
- Better than ReLU activation
- More expressive for same parameter count
- Used by: Llama family
RMSNorm:
- Faster than LayerNorm
- Same performance, fewer operations
- Used by: Modern efficient models
Multi-Query Attention (MQA) / Grouped-Query Attention (GQA):
- Shares keys/values across heads
- Reduces memory for long sequences
- Used by: Llama-3.1, TinyLlama (in some variants)
Practical Implications for Training
From CT-4 (Fine-tuning) to CT-8 (From Scratch)
Fine-tuning (what you did in CT-4):
# Load pre-trained model
model = load_pretrained("TinyLlama-1.1B")
# All architecture decisions already made:
# - 22 layers
# - 2048 hidden_dim
# - 32 attention heads
# - etc.
# Just adjust weights
train(model, your_dataset)
Training from scratch (CT-8):
# YOU decide the architecture
model = TransformerModel(
vocab_size=256, # Your choice!
hidden_dim=256, # Your choice!
num_layers=6, # Your choice!
num_heads=8, # Your choice!
mlp_dim=768, # Your choice!
)
# Initialize weights randomly
model.init_weights()
# Train from zero
train(model, your_dataset)
Key difference: You control every architectural decision.
graph TD
A[Training Approaches] --> B[Fine-TuningCT-4]
A --> C[From ScratchCT-8]
B --> B1[Start: Pre-trained ModelTinyLlama 1.1BAlready knows language]
B1 --> B2[Architecture: Fixed22 layers, 2048 hiddenCan't change structure]
B2 --> B3[Training: Fast500-1000 steps1-3 hours on N150]
B3 --> B4[Result: SpecializedKeeps general knowledgeAdds new behavior]
C --> C1[Start: Random WeightsBlank slateKnows nothing]
C1 --> C2[Architecture: Your Choice6 layers, 256 hiddenYou design everything]
C2 --> C3[Training: Longer5000-10000 stepsMany hours]
C3 --> C4[Result: CustomLearns from data onlyTailored to task]
style B fill:#7B68EE,stroke:#333,stroke-width:2px
style C fill:#4A90E2,stroke:#333,stroke-width:2px
style B3 fill:#50C878,stroke:#333,stroke-width:2px
style C3 fill:#E85D75,stroke:#333,stroke-width:2px
Common Architecture Mistakes
❌ Don't: Make Everything Big
# This will OOM on N150 and train forever
bad-design:
hidden_dim: 4096 # Too big!
num_layers: 24 # Too many!
mlp_dim: 16384 # Way too big!
# Result: 2B+ parameters
✅ Do: Start Small, Scale Up
# This will work on N150
good-design:
hidden_dim: 256 # Reasonable
num_layers: 6 # Manageable
mlp_dim: 768 # 3× hidden_dim
# Result: ~11M parameters
❌ Don't: Use Incompatible Dimensions
bad-design:
hidden_dim: 256
num_heads: 7 # Not a divisor of 256!
# Error: hidden_dim must be divisible by num_heads
✅ Do: Keep Dimensions Compatible
good-design:
hidden_dim: 256
num_heads: 8 # 256 / 8 = 32 (perfect!)
Architecture Cheat Sheet
For Quick Reference
| Component | Typical Range | Nano-Trickster (CT-8) | TinyLlama |
|---|---|---|---|
vocab_size |
256-50,000 | 256 (char-level) | 32,000 (BPE) |
hidden_dim |
128-4096 | 256 | 2048 |
num_layers |
4-32 | 6 | 22 |
num_heads |
4-32 | 8 | 32 |
mlp_dim |
2-4× hidden | 768 (3×) | 5632 (2.75×) |
max_seq_len |
128-4096 | 512 | 2048 |
| Total params | - | ~11M | ~1.1B |
| Training time (N150) | - | 30-60 min | Many hours |
Beyond This Lesson: Architecting the Future of AI
You've learned how transformers work under the hood. But what can you build with this architectural knowledge? Let's explore how understanding architecture unlocks the ability to design specialized models that solve real problems.
What Developers Have Designed
Real architectures built by developers who understood the fundamentals:
🎯 "Code Completion Specialist" (Startup engineer)
- Challenge: Existing models too slow for real-time autocomplete
- Architecture insight: Reduced num_layers from 12 → 4, hidden_dim from 768 → 256
- Result: 10M parameter model with 5ms latency (vs 300ms for GPT-2)
- Trade-off: Narrower knowledge but specialized for Python syntax
- Impact: Shipped real-time code completion to 5000+ developers
🔬 "Protein Sequence Analyzer" (Biotech researcher)
- Challenge: Amino acid sequences need different tokenization than text
- Architecture choice: Character-level (20 amino acids), 8-layer transformer
- Innovation: Custom positional encoding for sequence distance
- Result: 50M parameter model outperformed 1B general models on protein tasks
- Insight: Domain knowledge + right architecture > brute force scale
🎮 "Game Dialogue Generator" (Indie studio)
- Challenge: 1B models too big for game runtime (memory constraints)
- Design: 30M params, 6 layers, 384 hidden_dim, character-level
- Optimization: Shared weights between encoder/decoder (20% size reduction)
- Result: Fits in 60MB, runs on console hardware, generates unique NPC dialogue
- Win: Architecture designed for deployment constraints from day 1
💼 "Legal Document Parser" (LegalTech company)
- Challenge: Legal text has 10x longer documents than typical LLMs handle
- Architecture innovation: Sparse attention (attend to every 4th token for long range)
- Result: 8K context window in 200M model (vs 2K in comparable dense models)
- Impact: Parse entire contracts in one pass, not chunked
- Learning: Attention pattern matters as much as model size
Specialized Architectures Beat General Models
Why architectural choices matter more than you think:
📊 Medical Q&A Model (100M params, specialized)
- Trained on 50K medical Q&A pairs
- Custom tokenizer for medical terminology
- 8 layers, 512 hidden_dim, medical-specific embeddings
- Performance: 85% accuracy on medical exams
- Comparison: GPT-3 (175B params): 60% accuracy on same exams
- Lesson: 100M specialized beats 175B general-purpose for niche domains
🔧 Hardware Verilog Generator (20M params)
- Character-level for Verilog syntax
- 4 layers, 256 hidden_dim (tiny!)
- Trained on 10K hardware designs
- Performance: Generates syntactically correct Verilog 92% of the time
- Comparison: GPT-4: 45% syntactically correct (not trained on enough Verilog)
- Lesson: Smaller, specialized models trained on quality data > huge general models
📝 Meeting Notes Summarizer (40M params)
- Encoder-decoder architecture (not decoder-only like GPT)
- 6 encoder layers, 4 decoder layers
- Custom attention for timestamp/speaker tracking
- Performance: Summarizes 1-hour meeting in 30 seconds
- Comparison: Claude 3 Opus does it too, but costs $0.50/summary vs $0.01
- Lesson: Specialized architecture enables cost-effective deployment
Architectural Patterns to Learn From
Design patterns that solve real problems:
🚀 Tiny Transformers (1-50M params) When to use:
- Real-time applications (autocomplete, chat suggestions)
- Edge deployment (mobile, embedded)
- Low-latency requirements (<10ms)
Architecture choices:
- 4-8 layers (shallow but fast)
- 128-384 hidden_dim (small embeddings)
- Character or small vocab (reduce embedding size)
- Example: MobileBERT (25M params), DistilBERT (66M)
🎯 Long-Context Transformers (50-500M params) When to use:
- Document analysis (legal, research papers)
- Code repositories (understand full files)
- Conversation history (multi-turn chat)
Architecture choices:
- Sparse attention patterns (Longformer, BigBird)
- Memory-efficient attention (FlashAttention)
- Sliding window + global attention
- Example: Longformer (148M params, 4K context)
🔬 Domain-Specific Transformers (20-200M params) When to use:
- Specialized vocabulary (medical, legal, code)
- Narrow but deep knowledge
- High accuracy > broad knowledge
Architecture choices:
- Custom tokenizer (domain-specific vocabulary)
- Embeddings pre-trained on domain data
- Architecture sized for task complexity
- Example: BioBERT (110M), CodeBERT (125M)
💡 Efficient Inference Transformers (10-100M params) When to use:
- Production deployment at scale
- Cost-sensitive applications
- High throughput requirements
Architecture choices:
- Knowledge distillation (student learns from teacher)
- Quantization-friendly designs
- Smaller FFN layers (reduce 75% of params)
- Example: TinyLlama (1.1B → efficient for hardware)
Your Architecture Design Journey
From understanding to creation:
Week 1 (Understanding - this lesson):
- Study how attention works
- Calculate parameter counts
- Understand memory/compute trade-offs
- Goal: Read architectures and understand choices
Week 2 (Experimentation - CT-8):
- Train nano-trickster (11M params)
- Modify hidden_dim, see effect on performance
- Try different num_heads
- Goal: Build intuition through hands-on experience
Month 2 (Specialization):
- Design model for your specific task
- Choose tokenization strategy
- Size architecture for your hardware
- Goal: Create custom architecture that fits your needs
Month 3+ (Innovation):
- Experiment with novel attention patterns
- Custom position encodings
- Efficient architectural tricks
- Goal: Push boundaries, contribute new ideas
Architectural Decisions That Changed Everything
Real examples of how architectural choices enable breakthroughs:
🌟 Rotary Position Embeddings (RoPE)
- Old way: Learned position embeddings (fixed max length)
- Innovation: Rotate embeddings based on position
- Impact: Models generalize beyond training length
- Adoption: LLaMA, TinyLlama, most modern models
- Lesson: Better position encoding = longer contexts for free
⚡ Grouped-Query Attention (GQA)
- Old way: Every head has its own keys/values (memory intensive)
- Innovation: Share keys/values across groups of heads
- Impact: 30% memory reduction, minimal accuracy loss
- Adoption: LLaMA-3, Mistral
- Lesson: Attention efficiency unlocks longer contexts
🎯 SwiGLU Activation
- Old way: ReLU activation (simple but limited)
- Innovation: Gated linear units with swish
- Impact: Better gradient flow, more expressive
- Adoption: LLaMA family, PaLM
- Lesson: Activation function choice matters
🔧 RMSNorm vs LayerNorm
- Old way: LayerNorm (compute mean and variance)
- Innovation: RMSNorm (just RMS, skip mean)
- Impact: 10-15% faster, same performance
- Adoption: LLaMA, TinyLlama, modern efficient models
- Lesson: Small optimizations compound across layers
Imagine: Models You Could Design
With your architectural knowledge, you could build:
🚀 Real-Time Code Autocomplete (5M params)
- 3 layers, 128 hidden_dim, character-level
- Optimized for <5ms latency
- Specialized for Python/JavaScript syntax
- Deployment: Developer tools, IDE plugins
📊 Financial Report Analyzer (30M params)
- 6 layers, 384 hidden_dim, financial terminology tokenizer
- Custom attention for table parsing
- Deployment: Analyst workflows, automated reporting
🎨 Style Transfer Text Rewriter (15M params)
- Encoder-decoder (6+4 layers)
- Style embeddings (formal, casual, technical)
- Deployment: Content marketing, email assistants
🔬 Scientific Paper Summarizer (50M params)
- 8 layers, 512 hidden_dim, academic vocabulary
- Long-context attention (8K tokens)
- Deployment: Research tools, literature review
🎮 Game Narrative Generator (20M params)
- 5 layers, 256 hidden_dim, fantasy/sci-fi vocabulary
- Character-aware generation
- Deployment: Game studios, interactive fiction
The Architecture Decision Tree
How to design your model:
graph TD
A[What's your primary constraint?] --> B{Latency}
A --> C{Memory}
A --> D{Accuracy}
B --> E[Tiny model4 layers, 128-256 hidden1-10M params]
C --> F[Efficient model6 layers, 384 hidden10-50M params]
D --> G[Larger model12+ layers, 768+ hidden100M+ params]
E --> H[Real-time applicationsautocomplete, suggestions]
F --> I[Production deploymentAPIs, mobile apps]
G --> J[High-accuracy tasksresearch, analysis]
K[What's your data?] --> L{Lots of data}
K --> M{Limited data}
L --> N[Train from scratchCustom architecture]
M --> O[Fine-tune existingAdapt architecture]
style E fill:#50C878,stroke:#333,stroke-width:2px
style F fill:#7B68EE,stroke:#333,stroke-width:2px
style G fill:#E85D75,stroke:#333,stroke-width:2px
Your design process:
- Define constraints (latency, memory, accuracy requirements)
- Choose base architecture (decoder-only, encoder-decoder, etc.)
- Size the model (layers, hidden_dim based on task complexity)
- Select components (attention type, activation, normalization)
- Iterate (train small, evaluate, adjust)
From CT-7 to CT-8: Your Design in Action
What you'll do in the next lesson:
- Take architectural knowledge from this lesson
- Design nano-trickster (11M params) from scratch
- See how each component contributes to learning
- Outcome: Practical experience with architectural decisions
The progression:
- CT-7 (Now): Understand components conceptually
- CT-8 (Next): Build and train your design
- Future: Design specialized models for your domains
You now have:
- ✅ Mental models for architecture trade-offs
- ✅ Understanding of where parameters live
- ✅ Knowledge of memory/compute costs
- ✅ Ability to evaluate architecture choices
The question isn't "Can I design a custom architecture?"
The question is "What specialized model will I design first?"
Imagine:
- A 10M model that solves your specific problem better than GPT-4
- An architecture optimized for your hardware constraints
- A model that runs in production at 1/10th the cost
- A design that becomes the foundation for your product
Architecture isn't just theory. It's power to build exactly what you need.
Key Takeaways
✅ Transformers have 6 key components: tokenization, embeddings, attention, FFN, normalization, output
✅ Most parameters live in FFN layers (60-70% of total)
✅ Architecture decisions affect training time and memory significantly
✅ Start small (10-20M params), scale up when you understand the trade-offs
✅ Modern improvements (RoPE, SwiGLU, RMSNorm) make models more efficient
✅ hidden_dim and num_layers are your main scaling knobs
Next Steps
Lesson CT-8: Training from Scratch
You now understand the components. In CT-8, you'll:
- Design a nano-trickster architecture (10-20M params)
- Initialize it from scratch
- Train on tiny-shakespeare dataset
- See a model learn language from random initialization
- Compare to random baseline (prove learning happened!)
Estimated time: 30 minutes (setup) + 30-60 minutes (training) Prerequisites: CT-7 (this lesson)
Additional Resources
Papers
- Attention Is All You Need - Original transformer paper
- BERT - Bidirectional transformers
- GPT-2 - Decoder-only architecture
- LLaMA - Modern efficient architecture
Interactive Visualizations
- The Illustrated Transformer - Visual explanations
- Transformer Explainer - Interactive visualization
Code References
- nanoGPT - Minimal GPT implementation
- TinyLlama - Training logs and architecture
- tt-train - TT-specific training framework
Ready to build your first model from scratch? Continue to Lesson CT-8: Training from Scratch →