N150 N300 T3K P100 P150 P300C Galaxy 30 min Draft

Module 2: The Memory Hierarchy

Introduction: Why Memory Matters More Than Compute

In Module 1, we measured something striking:

Computing a + b: ~4 cycles
Fetching a from DRAM: ~200 cycles

Memory is 50x slower than computation.

This imbalance is the defining challenge of modern computing. It doesn't matter how fast your processor is if it's waiting for data.

What You'll Learn

✅ The Memory Hierarchy - Registers → Cache → DRAM → Disk
✅ Locality of Reference - Why some code is 100x faster than other code
✅ Bandwidth vs Latency - Two different concepts often confused
✅ Near-Memory Compute - Why Tenstorrent's architecture is different
✅ Practical Optimization - How to structure data for performance

Key Insight: Fast code is about memory access patterns, not clever algorithms.

Part 1: CS Theory - The Memory Hierarchy

The Memory Pyramid

Every computer system has a hierarchy of memory, trading size for speed:

graph TD
    subgraph "Memory Hierarchy"
    R["Registers32 × 32-bit0 cycles~1 KB"]
    L1["L1 SRAM1.5 MB per core1-2 cyclesFAST"]
    L2["L2 Cache256 KB - 32 MB10-20 cyclesMedium"]
    DRAM["DRAM1-64 GB200+ cyclesSLOW"]
    DISK["Disk/SSDTB scale100,000+ cyclesVERY SLOW"]
    end

    R --> L1
    L1 --> L2
    L2 --> DRAM
    DRAM --> DISK

    style R fill:#499c8d,stroke:#fff,color:#fff
    style L1 fill:#3293b2,stroke:#fff,color:#fff
    style L2 fill:#5347a4,stroke:#fff,color:#fff
    style DRAM fill:#666,stroke:#fff,color:#fff
    style DISK fill:#333,stroke:#fff,color:#fff

The Rule: Each level is ~10x larger and ~10x slower than the one above it.

Why This Hierarchy Exists

Q: Why not make everything as fast as registers? A: Cost and physics.

Memory Type	Cost per GB	Latency	Size Limit
SRAM (registers/L1)	$10,000	1 cycle	~10 MB
DRAM	$10	200 cycles	100+ GB
SSD	$0.10	100,000 cycles	TB scale

SRAM is 1000x more expensive than DRAM. Building a 64 GB system from SRAM would cost $640,000!

Physics: Larger memories have longer wires → higher latency. Can't beat the speed of light.

Locality of Reference

Programs exhibit two types of locality:

Temporal Locality

"If you access memory location X, you'll likely access it again soon."

Example:

// Loop counter 'i' is accessed repeatedly
for (int i = 0; i < 1000; i++) {
    sum += data[i];
}

Optimization: Keep i in a register (0 cycles) instead of reloading from memory each time.

Spatial Locality

"If you access memory location X, you'll likely access X+1, X+2... soon."

Example:

// Array elements accessed sequentially
for (int i = 0; i < 1000; i++) {
    sum += data[i];  // data[0], data[1], data[2]...
}

Optimization: Fetch entire cache line (64 bytes) when you access data[0], not just 4 bytes.

The 90-10 Rule

Empirical observation: 90% of execution time is spent in 10% of the code.

This is WHY caching works:

Most programs access a small "working set" repeatedly
Keep that working set in fast memory (L1/L2 cache)
Only go to slow DRAM when absolutely necessary

If programs accessed memory randomly, caching wouldn't help.

Part 2: Industry Context - Memory Bottlenecks Everywhere

Why NumPy is Fast (and Pure Python is Slow)

Consider summing 1 million numbers:

Pure Python:

# Slow: ~100ms
total = 0
for i in range(1_000_000):
    total += numbers[i]  # Python list: pointer chase every iteration

NumPy:

# Fast: ~1ms (100x faster!)
total = np.sum(numbers)  # Contiguous memory, vectorized, cache-friendly

Why the difference?

Python list: Each element is a pointer to a PyObject (cache misses everywhere)
NumPy array: Contiguous memory, sequential access (cache hits galore)

It's not the algorithm (both are O(n)) - it's the memory access pattern.

Database Columnar Storage

Traditional row-based storage:

Row 1: [id=1,  name="Alice", age=30, salary=100000]
Row 2: [id=2,  name="Bob",   age=25, salary=80000]
Row 3: [id=3,  name="Carol", age=35, salary=120000]

Query: "SELECT AVG(salary) FROM employees"

Must load ALL data (id, name, age, salary) even though we only need salary
Poor cache utilization

Columnar storage (Parquet, BigQuery):

Column salary: [100000, 80000, 120000]

Only load the column you need
Sequential access = excellent cache performance
10-100x faster for analytical queries

GPU Kernel Fusion

Problem: Separate operations = multiple memory round-trips

# Slow: 3 separate kernels, 3 DRAM reads/writes
y = relu(x)          # DRAM → GPU → DRAM
z = batchnorm(y)     # DRAM → GPU → DRAM
output = dropout(z)  # DRAM → GPU → DRAM

Solution: Fuse into one kernel

# Fast: 1 kernel, 1 DRAM read/write
output = dropout(batchnorm(relu(x)))  # Intermediate results stay in registers

Speedup: 3-5x, purely from avoiding memory traffic.

The lesson: Memory bandwidth is precious. Minimize transfers.

Part 3: On Tenstorrent Hardware - The Memory Hierarchy

Tenstorrent's Memory Architecture

flowchart TD
    subgraph PROC["BRISC Processor (one of 880)"]
        REG["**Registers** — 32 × 32-bit\nAccess: 0 cycles · Size: 128 bytes"]
        L1["**L1 SRAM** — 1.5 MB (shared by 5 cores)\nAccess: 1–2 cycles · Bandwidth: ~1 TB/s"]
        REG --> L1
    end
    L1 -->|"NoC"| DRAM["**DRAM** — 12 GB (1 GB × 12 channels)\nAccess: 200+ cycles · Bandwidth: ~100 GB/s"]
    style REG fill:#1A3C47,stroke:#4FD1C5,color:#E8F0F2
    style L1 fill:#2D3142,stroke:#81E6D9,color:#E8F0F2
    style DRAM fill:#0F2A35,stroke:#607D8B,color:#E8F0F2

Key differences from a typical CPU:

Feature	Your Laptop CPU	Tenstorrent
L1 per core	32 KB	1.5 MB (47x larger!)
L2 cache	256 KB - 8 MB	None (explicit control)
Cache coherence	Automatic	None (explicit sync)
DRAM access	Implicit loads	Explicit DMA

Philosophy: Give programmers a LOT of fast memory (L1) and explicit control over movement.

No Automatic Caching

Your CPU hides the memory hierarchy:

int x = array[1000];  // CPU automatically:
                      // 1. Checks L1 cache (miss)
                      // 2. Checks L2 cache (miss)
                      // 3. Fetches from DRAM
                      // 4. Updates L1/L2 caches
                      // You don't control any of this

Tenstorrent makes it explicit:

// You must explicitly DMA from DRAM to L1
noc_async_read(dram_addr, l1_addr, size);
noc_async_read_barrier();

// Now you can access the data in L1
int x = *(int*)l1_addr;  // Fast (1 cycle)

Advantage: Complete control, predictable performance Disadvantage: More code, more responsibility

Here's what that explicit memory movement looks like on an actual Blackhole chip — DRAM banks along the top and bottom edges, data flowing into the compute grid's L1, then writing back:

⬡ Tensix Grid Visualizer Blackhole (P100/P150/P300c)

Near-Memory Compute

graph LR
    subgraph trad["Traditional (CPU)"]
        CPU["CPU"] <-->|"slow bus ⚠"| D1["DRAM"]
    end
    subgraph tt["Tenstorrent"]
        CORE["Tensix Core\n(Compute + 1.5 MB SRAM)"] <-->|"NoC"| D2["DRAM"]
    end
    style CPU fill:#5347a4,stroke:#fff,color:#fff
    style D1 fill:#607D8B,stroke:#fff,color:#fff
    style CORE fill:#1A3C47,stroke:#4FD1C5,color:#4FD1C5
    style D2 fill:#2D3142,stroke:#81E6D9,color:#E8F0F2

Traditional (CPU): Compute happens far from memory — all data must travel over the slow bus, which becomes the bottleneck.

Tenstorrent: Compute happens next to fast L1 SRAM — bring the working set from DRAM once, do many operations in L1. Memory-bandwidth advantage.

This is why Tenstorrent is good at AI workloads - lots of reuse of data in L1.

Part 4: Hands-On - Measuring Memory Performance

Let's write experiments to measure latency and bandwidth.

Experiment Setup

We'll create two programs:

Latency test: Access single elements (worst case for cache)
Bandwidth test: Access large sequential chunks (best case for cache)

Build the Memory Hierarchy Example

First, let's build a custom example:

cd ~/tt-metal
./build_metal.sh --build-programming-examples

🔨 Build Examples

VS Code

cd ~/tt-metal && ./build_metal.sh --build-programming-examples

Experiment 1: Latency - Random Access

Scenario: Access 1000 random locations in DRAM

// Kernel: random_access.cpp
void kernel_main() {
    uint32_t* indices = (uint32_t*)get_arg_val<uint32_t>(0);  // Random indices
    uint32_t* dram_data = (uint32_t*)get_arg_val<uint32_t>(1);

    uint64_t start = get_cycle_count();

    uint32_t sum = 0;
    for (int i = 0; i < 1000; i++) {
        uint32_t idx = indices[i];
        // Each access hits a different cache line - WORST CASE
        noc_async_read(dram_data + idx, l1_buffer, sizeof(uint32_t));
        noc_async_read_barrier();
        sum += *(uint32_t*)l1_buffer;
    }

    uint64_t cycles = get_cycle_count() - start;
    DPRINT << "Random access: " << cycles << " cycles\n";
    DPRINT << "Avg latency: " << cycles/1000 << " cycles per access\n";
}

Expected result:

Random access: ~200,000 cycles
Avg latency: ~200 cycles per access

Why? Each access:

Sends NoC request to DRAM controller
Waits for DRAM latency (~200 cycles)
Transfers data back over NoC
No caching benefit (every access is to a different location)

Experiment 2: Bandwidth - Sequential Access

Scenario: Access 1000 sequential locations in DRAM

// Kernel: sequential_access.cpp
void kernel_main() {
    uint32_t* dram_data = (uint32_t*)get_arg_val<uint32_t>(0);

    uint64_t start = get_cycle_count();

    // Transfer 1000 integers (4 KB) in ONE DMA
    noc_async_read(dram_data, l1_buffer, 4000);
    noc_async_read_barrier();

    // Now compute on L1 data (fast!)
    uint32_t sum = 0;
    uint32_t* data = (uint32_t*)l1_buffer;
    for (int i = 0; i < 1000; i++) {
        sum += data[i];  // Each access: 1 cycle (L1)
    }

    uint64_t cycles = get_cycle_count() - start;
    DPRINT << "Sequential access: " << cycles << " cycles\n";
    DPRINT << "Avg time: " << cycles/1000 << " cycles per element\n";
}

Expected result:

Sequential access: ~400 cycles (for entire 4 KB)
Avg time: ~0.4 cycles per element

Why the massive difference?

Latency: 200 cycles to start the transfer
Bandwidth: 1-2 cycles per 32 bytes after that
Transferring 1000 elements together amortizes latency cost

Key formula:

Time = Latency + (Size / Bandwidth)
     = 200 cycles + (4000 bytes / 32 bytes-per-cycle)
     = 200 + 125 = ~325 cycles

The Latency vs Bandwidth Insight

graph LR
    A["Latency(Time to first byte)"] -->|200 cycles| B["Bandwidth(Bytes per cycle)"]
    B -->|32 bytes/cycle| C["Total Time"]

    style A fill:#5347a4,stroke:#fff,color:#fff
    style B fill:#3293b2,stroke:#fff,color:#fff
    style C fill:#499c8d,stroke:#fff,color:#fff

Latency: Fixed cost to start a transfer (like driving to the store) Bandwidth: Rate of data transfer (like how much you can carry per trip)

Optimization: Make fewer, larger transfers (one trip with a truck, not 100 trips with a bicycle).

Part 5: Practical Optimization Strategies

Strategy 1: Prefetching

Problem: Waiting for data stalls the processor

Solution: Start loading next data while processing current data

// Bad: Serial (load → compute → load → compute)
for (int i = 0; i < 100; i++) {
    noc_async_read(dram + i*SIZE, l1_buffer, SIZE);
    noc_async_read_barrier();
    process(l1_buffer);  // Processor idle during load
}
// Time: 100 × (200 cycles load + 50 cycles compute) = 25,000 cycles

// Good: Pipelined (overlap load and compute)
noc_async_read(dram, l1_buffer[0], SIZE);  // Start first load
for (int i = 0; i < 99; i++) {
    noc_async_read(dram + (i+1)*SIZE, l1_buffer[1], SIZE);  // Start next load
    noc_async_read_barrier();  // Wait for PREVIOUS load
    process(l1_buffer[0]);     // Process while next load happens
    swap(l1_buffer[0], l1_buffer[1]);  // Swap buffers
}
process(l1_buffer[0]);  // Process last chunk
// Time: 200 cycles (first load) + 100 × max(200, 50) = 20,200 cycles

Speedup: 1.24x from overlapping memory and compute.

Strategy 2: Blocking (Tiling)

Problem: Working set doesn't fit in L1

Solution: Process data in L1-sized chunks

// Bad: Process entire 10 MB array (doesn't fit in L1)
for (int i = 0; i < 10*1024*1024; i++) {
    result[i] = input[i] * 2;  // DRAM access every iteration - SLOW
}

// Good: Process in 1 MB chunks (fits in L1)
for (int block = 0; block < 10; block++) {
    // Load 1 MB into L1
    noc_async_read(dram + block*MB, l1_input, MB);
    noc_async_read_barrier();

    // Process entirely from L1 (fast!)
    for (int i = 0; i < MB/4; i++) {
        l1_output[i] = l1_input[i] * 2;  // 1 cycle per iteration
    }

    // Write back to DRAM
    noc_async_write(l1_output, dram_out + block*MB, MB);
    noc_async_write_barrier();
}

This is how matrix multiplication works:

Divide matrices into blocks that fit in L1
Load blocks once, reuse many times
10-100x speedup over naive implementation

Strategy 3: Data Layout

Problem: Access pattern determines performance

Bad layout (Array of Structs):

struct Particle {
    float x, y, z;      // Position (12 bytes)
    float vx, vy, vz;   // Velocity (12 bytes)
    float mass;         // Mass (4 bytes)
    // Total: 28 bytes per particle
};
Particle particles[1000];

// Update positions (only need x, y, z, vx, vy, vz)
for (int i = 0; i < 1000; i++) {
    particles[i].x += particles[i].vx;  // Load entire 28-byte struct
    particles[i].y += particles[i].vy;  // Even though we only use 24 bytes
    particles[i].z += particles[i].vz;  // 14% wasted bandwidth!
}

Good layout (Struct of Arrays):

struct Particles {
    float x[1000], y[1000], z[1000];
    float vx[1000], vy[1000], vz[1000];
    float mass[1000];
};
Particles particles;

// Update positions (only load what we need)
for (int i = 0; i < 1000; i++) {
    particles.x[i] += particles.vx[i];  // Load only x and vx
    particles.y[i] += particles.vy[i];  // Sequential access, cache-friendly
    particles.z[i] += particles.vz[i];  // No wasted bandwidth
}

Benefit: Better cache utilization, sequential access, no wasted bandwidth.

This is why columnar databases exist!

Part 6: Discussion Questions

Question 1: Why Not Just Use More SRAM?

Q: If SRAM is 50x faster than DRAM, why not build computers with 64 GB of SRAM instead of DRAM?

A: Cost, power, and physics.

Cost: 64 GB SRAM = $640,000 (vs $640 for DRAM)
Power: SRAM uses 10x more power per bit than DRAM
Size: SRAM transistor count: 6T per bit, DRAM: 1T+1C per bit
Physics: Larger memory = longer wires = higher latency

Design tradeoff: Use SRAM where it matters most (close to compute), DRAM for bulk storage.

Question 2: What if We Could Eliminate Memory Latency?

Thought experiment: Imagine we invented instant memory (0-cycle latency, infinite bandwidth).

Q: Would computers be infinitely fast?

A: No! Many programs are compute-bound, not memory-bound:

Cryptography (lots of bitwise operations)
Compression (complex algorithms)
Ray tracing (compute-intensive per pixel)

But for data processing (databases, AI inference, scientific computing), memory is the bottleneck.

This is why Tenstorrent's near-memory compute matters.

Question 3: How Does Cache Coherence Work?

Scenario: Two CPU cores accessing the same memory location

Core 0: Reads X (value: 42)
Core 1: Writes X ← 100
Core 0: Reads X again

Q: Should Core 0 see 42 or 100?

A: 100 (on x86/ARM). Hardware cache coherence protocols (MESI, MOESI) automatically synchronize:

Core 1 writes X ← 100
Hardware invalidates Core 0's cached copy
Core 0's next read fetches updated value from Core 1's cache

On Tenstorrent: NO cache coherence. You must explicitly synchronize (Module 5 covers this).

Tradeoff:

Automatic coherence: Easier to program, but complex hardware, doesn't scale well beyond ~64 cores
Explicit synchronization: More code, but simpler hardware, scales to 880 cores

Part 7: Real-World Example - Matrix Multiplication

Matrix multiplication is the perfect example of memory optimization.

Naive Implementation (Slow)

// C = A × B (all 1000×1000 matrices)
for (int i = 0; i < 1000; i++) {
    for (int j = 0; j < 1000; j++) {
        float sum = 0;
        for (int k = 0; k < 1000; k++) {
            sum += A[i][k] * B[k][j];  // Memory access every iteration
        }
        C[i][j] = sum;
    }
}

Problem:

Accesses B in column-major order (stride = 1000, not sequential)
Poor cache utilization
Lots of DRAM traffic

Performance: ~10 seconds on BRISC

Blocked Implementation (Fast)

// Process in 32×32 blocks that fit in L1
#define BLOCK 32

for (int i0 = 0; i0 < 1000; i0 += BLOCK) {
    for (int j0 = 0; j0 < 1000; j0 += BLOCK) {
        for (int k0 = 0; k0 < 1000; k0 += BLOCK) {
            // Load blocks of A and B into L1
            load_block(A, i0, k0);
            load_block(B, k0, j0);

            // Compute on L1 data (fast!)
            for (int i = i0; i < i0+BLOCK; i++) {
                for (int j = j0; j < j0+BLOCK; j++) {
                    for (int k = k0; k < k0+BLOCK; k++) {
                        C[i][j] += A_l1[i-i0][k-k0] * B_l1[k-k0][j-j0];
                    }
                }
            }
        }
    }
}

Benefits:

Each block loaded once, reused 32 times
All computation happens on L1 data
Sequential access patterns

Performance: ~0.5 seconds on BRISC (20x faster!)

This is what BLAS libraries (Intel MKL, OpenBLAS) do internally.

Part 8: Connections to Other Systems

CPUs (Intel, AMD, ARM)

Modern CPUs have 3-4 levels of cache:

Level	Size	Latency
L1 cache	32–64 KB per core	~4 cycles
L2 cache	256 KB – 1 MB per core	~12 cycles
L3 cache	8–32 MB shared	~40 cycles
DRAM	16–64 GB	~200 cycles

Similar hierarchy, automatic management (vs Tenstorrent's explicit control).

GPUs (NVIDIA, AMD)

GPU memory hierarchy:

Level	Size	Latency
Registers	64 KB per SM	0 cycles
Shared memory	48–100 KB per SM	1–2 cycles (like L1)
L1 cache	128 KB per SM	automatic
L2 cache	6–50 MB	automatic
Global memory (GDDR)	16–80 GB	200+ cycles

CUDA programming is all about managing shared memory (equivalent to our L1 SRAM).

High-Performance Databases

PostgreSQL, MySQL (row-based):

Optimized for transactional workloads (read/write full rows)
Good when you need all columns

BigQuery, Snowflake, ClickHouse (columnar):

Optimized for analytical workloads (read subset of columns)
10-100x faster for analytics
Same memory hierarchy principles apply

Part 9: Key Takeaways

After completing this module, you should understand:

✅ Memory Hierarchy - Registers → L1 → DRAM, each 10x slower
✅ Latency vs Bandwidth - Two different concepts, both matter
✅ Locality of Reference - Why some code is 100x faster
✅ Near-Memory Compute - Tenstorrent's architectural advantage
✅ Practical Optimization - Prefetching, blocking, data layout

The Core Insight

Modern computing is memory-bound, not compute-bound.

Your bottleneck is:

Not the algorithm's O(n²) complexity
Not the processor's GHz rating
The memory access pattern

Optimizing for cache = 10-100x speedup, often for free.

Part 10: Preview of Module 3 - Parallelism

Next, we scale from 1 core to 880 cores:

Teaser question: We just saw that sequential access to 1000 integers takes ~400 cycles on one core. If we split the work across 880 cores, how fast will it be?

Intuition: 880x faster → 0.45 cycles total
Reality: More like 5-10 cycles total (100-200x speedup)

Why the difference? Communication overhead, synchronization, and Amdahl's Law.

Module 3 explores when parallelism helps (and when it doesn't).

Additional Resources

Memory Architecture

"What Every Programmer Should Know About Memory" by Ulrich Drepper (free PDF)
"Computer Architecture: A Quantitative Approach" by Hennessy & Patterson (Chapter 5)

Performance Optimization

"Software Optimization Resources" by Agner Fog (free)
Intel Optimization Manual (cache behavior deep dive)

Tenstorrent Specific

Metalium Programming Guide: ~/tt-metal/METALIUM_GUIDE.md
Memory Architecture Docs: ~/tt-metal/tech_reports/Memory/

Summary

We explored:

Theory: Cache hierarchies, locality, latency vs bandwidth
Industry: Why NumPy is fast, why databases use columns, why GPUs fuse kernels
Tenstorrent: 1.5 MB L1 SRAM, explicit DMA, near-memory compute
Practice: Measured latency (200 cycles), bandwidth (32 bytes/cycle), optimizations (20x speedup)

Next: We take what we've learned and scale to 880 cores in parallel.

→ Continue to Module 3: Parallel Computing