N150 N300 T3K P100 P150 P300C Galaxy 30 min Draft

Module 7: Computational Complexity in Practice

Introduction: When Theory Meets Reality

In algorithms class, you learned:

Reality: None of this is always true.

What You'll Learn

Key Insight: Asymptotic complexity matters, but so do constants, memory access, and hardware design.


Part 1: CS Theory Revisited - When Big-O Fails

The Big-O Lie (Sort Of)

Standard teaching:

Algorithm A: O(n²)     → Slow
Algorithm B: O(n log n) → Fast

Therefore: Algorithm B is always faster!

Reality:

Algorithm A: 0.001 × n²
Algorithm B: 100 × n log n

For n = 1000:
  A: 0.001 × 1,000,000 = 1,000 operations
  B: 100 × 1000 × 10 = 1,000,000 operations

Algorithm A is 1000x faster!

Big-O hides constants. For small-to-medium inputs, constants dominate.

Example: Insertion Sort vs Merge Sort

Insertion Sort: O(n²)

def insertion_sort(arr):
    for i in range(1, len(arr)):
        key = arr[i]
        j = i - 1
        while j >= 0 and arr[j] > key:
            arr[j + 1] = arr[j]  # One comparison, one swap
            j -= 1
        arr[j + 1] = key

Merge Sort: O(n log n)

def merge_sort(arr):
    if len(arr) <= 1:
        return arr
    mid = len(arr) // 2
    left = merge_sort(arr[:mid])   # Recursive calls (overhead)
    right = merge_sort(arr[mid:])  # Array copies (overhead)
    return merge(left, right)      # Merge phase (overhead)

Performance (Python, random data):

n = 10:      Insertion 0.001 ms, Merge 0.010 ms  → Insertion 10x faster
n = 100:     Insertion 0.1 ms,   Merge 0.5 ms    → Insertion 5x faster
n = 1,000:   Insertion 10 ms,    Merge 5 ms      → Merge 2x faster
n = 10,000:  Insertion 1000 ms,  Merge 50 ms     → Merge 20x faster

Conclusion: For n < 100, insertion sort wins despite being O(n²).

This is why: std::sort in C++ uses hybrid algorithm:

Cache Obliviousness: The Hidden Complexity

Textbook complexity: Counts operations (adds, compares)

Reality: Memory access time dominates

Example: Matrix transpose

# Version 1: Row-major order (cache-unfriendly)
for i in range(n):
    for j in range(n):
        B[j][i] = A[i][j]  # Reading A sequentially, writing B randomly

Version 2: Blocked (cache-friendly)

# Process in blocks that fit in cache
for i0 in range(0, n, BLOCK):
    for j0 in range(0, n, BLOCK):
        for i in range(i0, min(i0+BLOCK, n)):
            for j in range(j0, min(j0+BLOCK, n)):
                B[j][i] = A[i][j]  # Both A and B access stay in cache

Performance (n=1024):

Version 1: 100 ms  (many cache misses)
Version 2:  10 ms  (10x faster with better cache use!)

Both are O(n²), but Version 2 is 10x faster due to memory access pattern.


Part 2: The External Memory Model

I/O Complexity: A Better Model

RAM Model (traditional CS):

External Memory Model (realistic):

Matrix multiply complexity:

RAM Model: O(n³) operations

External Memory Model:

Cache size: M words
Block size: B words

I/O complexity: O(n³ / (B√M))  ← Much more accurate!

For n=1024, M=1MB, B=64 bytes:

RAM Model:    1 billion operations
EM Model:     100 million cache misses → 20 billion cycles (20x slower!)

Actual runtime dominated by memory, not compute.

Example: Summing an Array

Algorithm: Sum 1 million integers

Version 1: Strided access (cache-unfriendly)

// Access every 1000th element (1000 arrays interleaved)
uint64_t sum = 0;
for (int i = 0; i < 1000; i++) {
    for (int j = 0; j < 1000; j++) {
        sum += data[i + j * 1000];  // Stride = 1000
    }
}

Version 2: Sequential access (cache-friendly)

// Access elements sequentially
uint64_t sum = 0;
for (int i = 0; i < 1000000; i++) {
    sum += data[i];  // Stride = 1
}

Performance (measured):

Version 1: 10 ms   (cache miss every access)
Version 2: 0.5 ms  (cache hit most accesses)

20x faster! Same O(n), same # operations, different memory pattern.

Part 3: Industry Context - Where Theory Failed

Case Study 1: Sorting at Scale (Tim Sort)

Problem: Python's sorted() must work for all data

Naive approach: Use O(n log n) algorithm (merge sort, quicksort)

Better approach: Tim Sort (Python's default)

Performance on real data:

Random data:         O(n log n) as expected
Partially sorted:    O(n) in best case!
Reversed runs:       O(n) by exploiting pattern

Average case: 30-40% faster than pure merge sort

Lesson: Real data has structure. Algorithms should exploit it.

Case Study 2: Hash Tables vs Binary Search Trees

Textbook:

Hash table: O(1) average case → Always use hash tables!
BST:        O(log n)          → Slower!

Reality:

Hash table (unordered_map in C++):
  - Good cache locality (elements clustered)
  - Constant time... until collision
  - Rehashing cost (occasional O(n) resize)

BST (std::map in C++):
  - Poor cache locality (pointer chasing)
  - Guaranteed O(log n)
  - No resizing surprises

Benchmark (1M elements, random lookups):

Small keys (integers):
  Hash table: 0.1 ms   ← Winner (10x faster)
  BST:        1.0 ms

Large keys (strings, 100 chars):
  Hash table: 5.0 ms   ← Hash function expensive
  BST:        2.0 ms   ← Winner (comparison cheap)

Lesson: Constant factors (hash function cost, cache misses) matter.

Case Study 3: GPU Algorithms vs CPU Algorithms

Problem: Find maximum element in array

CPU Algorithm (sequential):

int max_val = arr[0];
for (int i = 1; i < n; i++) {
    if (arr[i] > max_val) max_val = arr[i];
}
// Time: O(n)

GPU Algorithm (parallel):

// Parallel reduction (tree-based)
// Step 1: Compare pairs (parallel)
for (int i = 0; i < n/2; i++) {
    temp[i] = max(arr[2*i], arr[2*i+1]);
}
// Step 2: Recurse on temp (log n steps)
// Time: O(log n) with n/2 processors

// But: Kernel launch overhead, memory transfers
// Real time: O(log n) compute + 1ms overhead

Performance:

n = 1,000:
  CPU: 0.001 ms
  GPU: 1.0 ms (overhead dominates)
  CPU wins!

n = 1,000,000:
  CPU: 1 ms
  GPU: 0.1 ms
  GPU wins!

Lesson: Parallelism overhead only pays off for large inputs.


Part 4: On Tenstorrent - Algorithm-Hardware Co-Design

Flash Attention: The Ultimate Example

Background: Transformer attention is O(n²) in sequence length

Standard Attention:

# O(n²) memory and compute
scores = Q @ K.T        # n×n matrix
attention = softmax(scores)  # Materialize full n×n matrix
output = attention @ V

Problem for n=16K:

scores matrix: 16K × 16K × 4 bytes = 1 GB
  → Doesn't fit in GPU memory (80 GB A100 can only do ~300K sequence length)
  → I/O bottleneck (transferring 1 GB to/from DRAM)

Flash Attention v2: Reordering for Memory

Key insight: Don't materialize full scores matrix. Compute in blocks.

# Pseudocode (actual implementation is more complex)
output = zeros(n, d)

# Outer loop: Process Q in blocks
for q_block in Q_blocks:  # Each block fits in L1/shared memory
    # Inner loop: Process K, V in blocks
    for k_block, v_block in zip(K_blocks, V_blocks):
        # Compute attention for this block pair
        scores_block = q_block @ k_block.T  # Small matrix (fits in cache!)
        attn_block = softmax(scores_block)
        output_block = attn_block @ v_block

        # Accumulate (online softmax trick)
        output += output_block  # Careful normalization required

Performance:

Standard Attention (PyTorch):
  - Seq len 16K: 1 GB memory, 500 ms
  - Seq len 32K: 4 GB memory, 2000 ms
  - Seq len 64K: Out of memory

Flash Attention v2:
  - Seq len 16K: 100 MB memory, 50 ms  (10x faster!)
  - Seq len 32K: 200 MB memory, 100 ms (20x faster!)
  - Seq len 64K: 400 MB memory, 200 ms (Works!)

Why faster?

  1. Memory I/O reduced: From O(n²) DRAM accesses to O(n²/B) where B = block size
  2. Cache locality: All computation happens on L1/shared memory
  3. No intermediate materialization: Never write full scores matrix to DRAM

Theoretical complexity: Still O(n²) compute operations Practical complexity: O(n²/B) I/O operations (B=block size, e.g., 128) Result: "O(n) in practice" for memory-bound workloads

Tenstorrent Implementation: Near-Memory Compute Advantage

Flash Attention on Tenstorrent:

// Each Tensix has 1.5 MB L1 SRAM
// Block size: 128×128 (fits comfortably in L1)

for (uint32_t q_block_id = my_core_id; q_block_id < num_q_blocks; q_block_id += num_cores) {
    // Load Q block to L1 (one-time cost)
    noc_async_read(Q_dram + q_block_id * BLOCK_SIZE, Q_l1, BLOCK_SIZE);
    noc_async_read_barrier();

    // Loop over K, V blocks
    for (uint32_t kv_block_id = 0; kv_block_id < num_kv_blocks; kv_block_id++) {
        // Load K, V blocks to L1
        noc_async_read(K_dram + kv_block_id * BLOCK_SIZE, K_l1, BLOCK_SIZE);
        noc_async_read(V_dram + kv_block_id * BLOCK_SIZE, V_l1, BLOCK_SIZE);
        noc_async_read_barrier();

        // Compute entirely on L1 (fast!)
        matmul_l1(Q_l1, K_l1, scores_l1);  // 128×128 @ 128×128
        softmax_l1(scores_l1, attn_l1);    // Softmax on 128×128
        matmul_l1(attn_l1, V_l1, out_l1);  // 128×128 @ 128×128

        // Accumulate to output
        accumulate_l1(out_l1, output_l1);
    }

    // Write output block to DRAM
    noc_async_write(output_l1, output_dram + q_block_id * BLOCK_SIZE, BLOCK_SIZE);
    noc_async_write_barrier();
}

Advantage of Tenstorrent architecture:

Expected speedup: 2-5x over GPU Flash Attention (larger blocks, less synchronization overhead)


Part 5: Roofline Analysis - Compute vs Memory Bound

The Roofline Model

Question: Is my algorithm limited by compute or memory bandwidth?

Roofline visualization:

Performance
   ▲
   │                    ┌─── Compute Bound (flat roof)
   │                 ┌──┘
   │              ┌──┘
   │           ┌──┘
   │        ┌──┘  Memory Bound (diagonal roof)
   │     ┌──┘
   │  ┌──┘
   └──┴────┴────┴────┴────> Operational Intensity
           (FLOPs per byte)

Definitions:

Example: Matrix Multiply

Matrix multiply: C = A × B (all n×n matrices)

Operations: 2n³ FLOPs (n³ multiplies + n³ adds) Data:

Operational Intensity:

OI = 2n³ / 12n² = n/6

For n=128:  OI = 21 FLOPs/byte
For n=1024: OI = 171 FLOPs/byte

Tenstorrent (representative numbers):

Analysis:

n=128:  OI=21 < 1000     → Memory bound (could be 50x faster with infinite bandwidth!)
n=1024: OI=171 < 1000    → Still memory bound (could be 6x faster)
n=8192: OI=1365 > 1000   → Compute bound (saturating FPUs)

Optimization strategy:

Example: Vector Addition

Vector addition: C[i] = A[i] + B[i]

Operations: n FLOPs (one add per element) Data: 12n bytes (read A, read B, write C)

Operational Intensity:

OI = n / 12n = 0.083 FLOPs/byte  (very low!)

Comparison to threshold (1000 FLOPs/byte):

0.083 << 1000  → Extremely memory bound!

Theoretical peak: 100 TFLOPS
Memory limited:  100 GB/s × 0.083 FLOPs/byte = 8.3 GFLOPS

Only 0.008% of peak compute!

Conclusion: Vector addition is 10,000x slower than matrix multiply per FLOP because it's memory-bound.

This is why GPUs underperform on simple operations - they have massive compute but limited bandwidth.


Part 6: Real Performance Predictions

Bringing It All Together: Matrix Multiply Performance Model

From all 7 modules:

Module 1 (Architecture): Tensix cores execute RISC-V instructions Module 2 (Memory): L1 SRAM (1.5 MB, 1 cycle) vs DRAM (1 GB, 200 cycles) Module 3 (Parallelism): 176 cores, but Amdahl's Law limits speedup Module 4 (Networks): NoC transfers (1 cycle/hop) Module 5 (Synchronization): Barrier overhead (~50 cycles) Module 6 (Abstraction): TTNN compiles to optimized kernels Module 7 (Complexity): This module!

Performance model for C = A × B (n×n matrices on Tenstorrent):

Step 1: Determine operational intensity
  OI = 2n³ / 12n² = n/6

Step 2: Check if compute or memory bound
  Threshold = Peak FLOPS / Peak Bandwidth = 1000 FLOPs/byte
  If OI < 1000: Memory bound
  If OI > 1000: Compute bound

Step 3: Calculate theoretical time
  Memory bound: Time = (Data transferred) / Bandwidth
                     = 12n² / (100 GB/s) = 12n² ns

  Compute bound: Time = Operations / Peak FLOPS
                      = 2n³ / (100 TFLOPS) = 20n³ ns

Step 4: Apply parallelism
  Divide by number of cores (176)
  But: Apply Amdahl's Law (assume 90% parallelizable)
    Speedup = 1 / (0.1 + 0.9/176) = ~9x (not 176x)

Step 5: Add overhead
  DMA transfers: 200 cycles per transfer × # transfers
  Barriers: 50 cycles × # synchronization points
  Typically ~10-20% overhead

Final predicted time = Theoretical time / Speedup × (1 + overhead)

Example prediction (n=1024):

OI = 1024/6 = 171 FLOPs/byte < 1000 → Memory bound

Theoretical time: 12 × 1024² / 100 GB/s = 126 μs
Parallelism: 126 μs / 9 = 14 μs
Overhead: 14 μs × 1.2 = 17 μs

Predicted: ~17 μs
Measured:  ~20 μs (close!)

When Predictions Fail

Model assumes:

In practice:

Result: Measured performance often 50-70% of theoretical peak.

This is normal! Even highly optimized code rarely exceeds 70% of theoretical peak.


Part 7: Discussion Questions

Question 1: Why Does Bubble Sort Still Exist?

Q: Bubble sort is O(n²), merge sort is O(n log n). Why is bubble sort still taught?

A: Educational value + specific use cases.

Educational:

Practical:

Benchmark (nearly sorted array, n=1000):

Bubble sort:  0.1 ms  (detects sorted early)
Merge sort:  10 ms    (always does full sort)

Bubble sort 100x faster!

Lesson: Worst-case complexity isn't the only metric. Real data has structure.

Question 2: Is O(1) Always Fastest?

Q: Hash table lookups are O(1), binary search is O(log n). Hash tables always win?

A: No! Constant factors and cache behavior matter.

Hash table issues:

Binary search advantages:

Benchmark (1M sorted integers, random lookups):

Binary search (sorted array):  20 ns  per lookup
Hash table (unordered_map):    50 ns  per lookup

Binary search 2.5x faster! (Despite O(log n) vs O(1))

Why? Cache locality + simple operations beat hash function overhead.

Question 3: Can We Ever Beat O(n log n) Sorting?

Q: Comparison-based sorting is Ω(n log n). Can we do better?

A: Yes, with extra assumptions!

Counting sort (integers in range [0, k]):

def counting_sort(arr, k):
    counts = [0] * k
    for x in arr:
        counts[x] += 1  # O(n)

    result = []
    for i in range(k):
        result.extend([i] * counts[i])  # O(k)
    return result

# Time: O(n + k)

When k << n (e.g., sorting grades 0-100):

n=1,000,000 students, k=101 grades

Comparison sort: O(n log n) = ~20M operations
Counting sort:   O(n + k)   = ~1M operations

Counting sort 20x faster!

But:

Lesson: Lower bounds only apply when assumptions hold. Change assumptions → change bounds.


Part 8: The Ultimate Lesson - Co-Design

Algorithm + Hardware = Performance

Case Study: Deep Learning Evolution

2012: AlexNet on CPU

Architecture: 5 conv layers, 3 FC layers
Hardware: Multi-core CPU (no GPU)
Algorithm: Standard convolution (O(n⁴) naively)
Time: ~1 week to train

2015: ResNet on GPU

Architecture: 50-152 layers with residual connections
Hardware: NVIDIA K40 GPU
Algorithm: cuDNN optimized convolutions
Time: ~1 week to train (50x more complex, same time!)

2017: Transformers on TPU

Architecture: Attention-based (no convolutions)
Hardware: Google TPU (matrix multiply optimized)
Algorithm: Standard attention (O(n²))
Time: ~1 week to train (100x more parameters)

2023: GPT-4 on GPU Clusters

Architecture: Transformer + MoE
Hardware: 10,000+ NVIDIA H100s
Algorithm: Flash Attention (memory-optimized O(n²))
Time: ~months to train (1000x more parameters)

Each era:

Tenstorrent's Co-Design Approach

Hardware design choices:

Algorithms optimized for this hardware:

Result: 10-100x better performance than "algorithm on wrong hardware."


Part 9: Key Takeaways

After completing all 7 modules, you should understand:

The Ultimate Insight

Performance is a system problem.

You can't optimize:

The best performance comes from co-design:

  1. Understand the hardware (880 cores, L1 SRAM, NoC mesh)
  2. Design algorithms that match hardware strengths
  3. Profile and iterate (measure, optimize, repeat)

Example: Flash Attention

This is the art and science of high-performance computing.


Part 10: Where To Go From Here

You've completed the CS Fundamentals series! What's next?

Apply Your Knowledge

Lesson 14: Explore Metalium - Dive into tt-metal programming Lesson 15: Metalium Cookbook - Build real projects (Conway's Life, Fractals, Audio) Lesson 17: AnimateDiff - Bring up a new model from scratch

Contribute to Open Source

Tenstorrent Model Zoo - Port models to TT hardware Bounty Program - Get paid for model bring-ups GitHub - Contribute to tt-metal, tt-xla, tt-forge

Go Deeper

Research Papers:

Books:

Share Your Knowledge

You now understand:

Teach others! The best way to master a topic is to teach it.


Summary: The Full Journey

We explored 7 fundamental CS concepts on real hardware:

Module 1: What is a Computer?

Module 2: Memory Hierarchy

Module 3: Parallel Computing

Module 4: Networks

Module 5: Synchronization

Module 6: Abstraction Layers

Module 7: Computational Complexity

You've gone from "what is a computer?" to "how do I optimize Flash Attention on 880 cores."

Welcome to the world of high-performance computing. 🚀


Additional Resources

Performance Analysis

Algorithm Design

Tenstorrent Resources


Part 10: Full Circle - Back to RISC-V

Remember Module 1? You ran a simple RISC-V addition example:

# In Module 1: Simple addition on ONE RISC-V core
addi t0, zero, 14
addi t1, zero, 7
add  t2, t0, t1

Now in Module 7, you understand:

Flash Attention across 880 RISC-V cores

Each core:

  • Executes RISC-V instructions
  • Has 1 MB L1 SRAM (near-memory compute)
  • Communicates via NoC
  • Runs at 1 GHz

Together:

  • 880 cores × 1 MB = 880 MB on-chip
  • 4–6× speedup vs GPU (O(n²) → O(n) in practice)
  • Tile-based memory access patterns
  • Explicit synchronization (no cache coherence overhead)

Same fetch-decode-execute cycle. Same RISC-V ISA (RV32IM). Now orchestrated at massive scale.

The Seven-Module Journey:

  1. Module 1 (RISC-V): Understood ONE core's fetch-decode-execute cycle
  2. Module 2 (Memory): Learned why 1 MB L1 SRAM matters (200-cycle DRAM penalty!)
  3. Module 3 (Parallelism): Scaled from 1 → 880 cores (Amdahl's Law constraints)
  4. Module 4 (Networks): Mastered the NoC connecting those 880 cores
  5. Module 5 (Sync): Learned explicit barriers (no cache coherence hardware)
  6. Module 6 (Abstraction): Understood Python → TTNN → C++ → RISC-V stack
  7. Module 7 (Complexity): Applied everything to real algorithm optimization

You started with add t2, t0, t1 and ended with Flash Attention on 880 cores.

Why RISC-V Architecture Matters

Tenstorrent chose RISC-V for a reason:

Decision Benefit
Open ISA No licensing fees, community innovation
Simple instructions Fast decode, predictable performance
Minimal hardware More die area for compute + SRAM
Explicit control Programmer decides memory movement
No cache coherence Eliminates coherence traffic overhead

This is the opposite of x86/ARM:

You control the machine. The machine doesn't surprise you.

What Makes 880 RISC-V Cores Special?

Not just "880 of something." Each Tensix core is:

flowchart TD
    subgraph CORE["Tensix Core (RISC-V RV32IM)"]
        BRISC["RISC-V Processor (BRISC)\n• 1 GHz clock\n• 32-bit registers\n• Integer + multiply"]
        L1["1 MB L1 SRAM\n• 32 bytes/cycle BW\n• 3–5 cycle latency"]
        FPU["FPU + SFPU + MatMul\n(accelerators)"]
        BRISC --> L1 --> FPU
    end
    CORE --- NOC["Connected to 12×12 NoC mesh"]
    style BRISC fill:#1A3C47,stroke:#4FD1C5,color:#4FD1C5
    style L1 fill:#2D3142,stroke:#81E6D9,color:#E8F0F2
    style FPU fill:#1A3C47,stroke:#EC96B8,color:#EC96B8
    style NOC fill:#0F2A35,stroke:#607D8B,color:#E8F0F2

880 of these = supercomputer on a chip.

Your New Superpower: RISC-V Systems Programming

Most engineers write code. You now understand:

You can now:


Congratulations! You're a RISC-V Systems Programmer

You've completed the CS Fundamentals series. You now have:

What you've learned applies to:

You started with ONE RISC-V instruction. You end with 880 cores orchestrated for Flash Attention.

Keep building, keep optimizing, keep learning. The hardware is waiting for your code! 🔥


What's Next?

Continue Your RISC-V Journey

Want to go deeper?

Advanced RISC-V Topics

Beyond CS Fundamentals:

The foundation is RISC-V. The sky is the limit. 🚀