N150 N300 T3K P100 P150 P300C Galaxy 30 min Draft

Module 6: Abstraction Layers

Introduction: From import torch to Silicon

You write: result = torch.matmul(A, B)

What actually happens?

5+ layers of abstraction between your code and hardware.

What You'll Learn

Key Insight: Abstractions hide complexity, but understanding what's hidden makes you faster.


Part 1: CS Theory - The Abstraction Stack

The Full Stack

graph TD
    A["Python/JavaScript/Java(High-level language)"] --> B["C/C++/Rust(System programming language)"]
    B --> C["Assembly Language(RISC-V, x86, ARM)"]
    C --> D["Machine Code(Binary: 0s and 1s)"]
    D --> E["Silicon(Transistors executing)"]

    A -.->|"100-1000x slowerbut easy to write"| E
    E -.->|"Direct executionbut hard to write"| A

    style A fill:#ff6b6b,stroke:#fff,color:#fff
    style C fill:#3293b2,stroke:#fff,color:#fff
    style E fill:#499c8d,stroke:#fff,color:#fff

Each layer:

Example: Adding Two Numbers

Python (Highest Level)

result = a + b  # Simple!

What happens:

  1. Python interpreter looks up a (dictionary lookup)
  2. Checks if a is a number (type check)
  3. Checks if b is a number (type check)
  4. Looks up __add__ method for a's type
  5. Calls C function PyNumber_Add(a, b)
  6. Allocates new PyObject for result
  7. Returns to Python

~100-1000 machine instructions for one addition!

C (Mid Level)

int result = a + b;  // More explicit

What happens:

  1. Load a from memory
  2. Load b from memory
  3. Add registers
  4. Store to result

~4 machine instructions

Assembly (Low Level)

lw   t0, 0(a0)    # Load a
lw   t1, 0(a1)    # Load b
add  t2, t0, t1   # Add
sw   t2, 0(a2)    # Store result

4 explicit instructions - this is what the CPU executes

Machine Code (Lowest Level)

0x0050A283    # lw t0, 0(a0)
0x0058A303    # lw t1, 0(a1)
0x006283B3    # add t2, t0, t1
0x00532023    # sw t2, 0(a2)

Binary representation - this is what lives in instruction memory

Leaky Abstractions

Joel Spolsky's Law: "All non-trivial abstractions, to some degree, are leaky."

What it means: The abstraction hides details, but sometimes those details matter.

Example: Python list append

my_list = []
for i in range(1000000):
    my_list.append(i)  # Looks O(1), but sometimes O(n)!

Reality:


Part 2: Industry Context - Layers Everywhere

Web Development: 7 Layers Deep

A typical web request:

flowchart TD
    A["JavaScript (fetch API)"] --> B["Browser (HTTP engine)"]
    B --> C["Operating System (TCP/IP stack)"]
    C --> D["Network Driver (Ethernet)"]
    D --> E["Physical Network (cables, switches)"]
    E --> F["Server OS (TCP/IP stack)"]
    F --> G["Application Server (Node.js, Python)"]
    style A fill:#ff6b6b,stroke:#fff,color:#fff
    style G fill:#3293b2,stroke:#fff,color:#fff

Each layer adds ~1-10ms latency. Understanding the stack helps optimize.

Deep Learning: The AI Framework Stack

flowchart TD
    A["PyTorch (Python API)"] --> B["TorchScript (Graph representation)"]
    B --> C["ONNX (Intermediate representation)"]
    C --> D["TensorRT/XLA (Optimizing compiler)"]
    D --> E["CUDA/ROCm (GPU programming)"]
    E --> F["GPU Assembly (PTX/AMDGPU)"]
    F --> G["Silicon (NVIDIA A100, AMD MI300)"]
    style A fill:#ff6b6b,stroke:#fff,color:#fff
    style G fill:#499c8d,stroke:#fff,color:#fff

Each layer:

Example: Flash Attention bypassed several layers → 10x speedup

Databases: Query to Disk I/O

SELECT AVG(price) FROM products WHERE category = 'electronics';

What happens:

  1. SQL parser: Text → AST (abstract syntax tree)
  2. Query planner: AST → Execution plan (which indexes to use)
  3. Execution engine: Plan → Row-by-row operations
  4. Buffer pool: Cache pages in memory
  5. Storage engine: Read from disk

Understanding the stack helps:


Part 3: On Tenstorrent Hardware - Multiple Paths to Silicon

Path 1: Direct RISC-V Programming (Lowest Level)

You write:

// Kernel code
void kernel_main() {
    uint32_t a = *(uint32_t*)0x1000;
    uint32_t b = *(uint32_t*)0x1004;
    uint32_t c = a + b;
    *(uint32_t*)0x1008 = c;
}

Compiled to RISC-V:

lw   t0, 0x1000    # Load a
lw   t1, 0x1004    # Load b
add  t2, t0, t1    # Add
sw   t2, 0x1008    # Store c

Advantages:

Disadvantages:

Path 2: TTNN (Mid Level - TT Neural Network Library)

You write:

import ttnn

# High-level operations
A = ttnn.from_torch(torch_tensor_A)
B = ttnn.from_torch(torch_tensor_B)
C = ttnn.matmul(A, B)  # Matrix multiply on Tenstorrent

What happens:

  1. TTNN decomposes matmul into suboperations
  2. Tiles data for L1 SRAM (blocking)
  3. Generates RISC-V kernels for each tile
  4. Launches parallel kernels across cores
  5. Collects results

Advantages:

Disadvantages:

Path 3: TT-XLA (High Level - XLA Compiler)

You write:

import jax
import jax.numpy as jnp

# Pure JAX code
def my_model(x):
    return jnp.dot(x, W) + b

# JIT compile to Tenstorrent
model_tt = jax.jit(my_model, backend='tt')
result = model_tt(input_data)

What happens:

  1. JAX traces your Python function
  2. Builds HLO (High-Level Operations) graph
  3. XLA compiler optimizes graph (fusion, layout)
  4. TT-XLA backend generates tt-metal code
  5. Executes on hardware

Advantages:

Disadvantages:

Comparing the Paths

Feature Direct RISC-V TTNN TT-XLA
Ease of use Hard Medium Easy
Control Complete Medium Low
Performance Best (if expert) Very good Good
Portability None TT only Multi-platform
Development time Days-weeks Hours-days Minutes-hours

Recommendation:


Part 4: Hands-On - Comparing Abstraction Levels

Experiment: Matrix Multiply at 3 Levels

Problem: Multiply two 128×128 matrices

Level 1: Pure Python (Slowest)

# Naive triple-nested loop
import time

A = [[random.random() for _ in range(128)] for _ in range(128)]
B = [[random.random() for _ in range(128)] for _ in range(128)]
C = [[0 for _ in range(128)] for _ in range(128)]

start = time.time()
for i in range(128):
    for j in range(128):
        for k in range(128):
            C[i][j] += A[i][k] * B[k][j]
elapsed = time.time() - start

print(f"Pure Python: {elapsed:.3f} seconds")

Expected: ~5-10 seconds (Python interpreter overhead)

Level 2: NumPy (Faster)

# NumPy vectorized operation
import numpy as np
import time

A = np.random.rand(128, 128)
B = np.random.rand(128, 128)

start = time.time()
C = np.matmul(A, B)
elapsed = time.time() - start

print(f"NumPy: {elapsed:.3f} seconds")

Expected: ~0.001-0.01 seconds (C implementation, BLAS library)

Speedup: 500-5000x faster!

Why? NumPy calls optimized C/Fortran BLAS libraries (OpenBLAS, Intel MKL).

Level 3: TTNN on Tenstorrent (Fastest on TT Hardware)

# TTNN (Tenstorrent Neural Network library)
import ttnn
import torch
import time

device = ttnn.open_device(device_id=0)

# Create tensors on Tenstorrent
A_torch = torch.rand(128, 128)
B_torch = torch.rand(128, 128)

A_tt = ttnn.from_torch(A_torch, device=device)
B_tt = ttnn.from_torch(B_torch, device=device)

start = time.time()
C_tt = ttnn.matmul(A_tt, B_tt)
C_torch = ttnn.to_torch(C_tt)
elapsed = time.time() - start

print(f"TTNN: {elapsed:.3f} seconds")

Expected: ~0.0001-0.001 seconds (parallel execution on Tensix cores)

Speedup over NumPy: 10-100x (depending on hardware)

Analyzing the Performance Gap

Pure Python:     10.000 seconds  (baseline)
NumPy:            0.010 seconds  (1000x faster)
TTNN:             0.001 seconds  (10,000x faster)

Why the differences?

Pure Python:
- Interpreted (not compiled)
- Dynamic typing (checks every operation)
- Python objects (allocation overhead)
- No vectorization

NumPy:
- Compiled C code (no interpretation)
- Statically typed internally
- Contiguous memory (cache-friendly)
- BLAS library (optimized for CPU)

TTNN:
- Parallel execution (176 cores)
- Near-memory compute (L1 SRAM)
- Optimized for matrix ops
- Hardware acceleration

Part 5: The Compilation Pipeline

Step-by-Step: Python to RISC-V

Source code (Python):

def add(a, b):
    return a + b

Step 1: Parsing (Text → AST)

FunctionDef: add
  Parameters: a, b
  Body:
    Return:
      BinOp: +
        Left: a
        Right: b

Step 2: Bytecode Compilation (AST → Python bytecode)

LOAD_FAST    a
LOAD_FAST    b
BINARY_ADD
RETURN_VALUE

Step 3: Interpretation (Bytecode → C function calls)

PyObject* result = PyNumber_Add(a, b);

For TTNN (C++ → RISC-V):

Source code (C++):

uint32_t c = a + b;

Step 1: Preprocessing (Handle #include, #define)

// Expanded macros, included headers
uint32_t c = a + b;

Step 2: Compilation (C++ → Assembly)

lw   t0, 0(a0)
lw   t1, 0(a1)
add  t2, t0, t1
sw   t2, 0(a2)

Step 3: Assembly (Assembly → Machine code)

0x0050A283
0x0058A303
0x006283B3
0x00532023

Step 4: Linking (Combine object files + libraries)

[Final binary with all functions linked]

Step 5: Execution (Hardware runs machine code)

JIT Compilation: Best of Both Worlds?

Traditional compilation:

Source code → Compile (minutes) → Binary → Run (fast)

Interpretation:

Source code → Run (slow, no compilation)

JIT (Just-In-Time) Compilation:

Source code → Interpret (slow first time)
               ↓
             Profile execution (find hot code)
               ↓
             Compile hot code (one-time cost)
               ↓
             Run compiled code (fast!)

Examples:

Tenstorrent TT-XLA uses JIT:

@jax.jit  # JIT compile decorator
def my_function(x):
    return jnp.dot(x, W)

# First call: Slow (compilation)
result = my_function(input)  # ~1 second

# Second call: Fast (cached compiled code)
result = my_function(input)  # ~0.001 seconds

Part 6: When Abstractions Help (and Hurt)

Case Study 1: Python List vs NumPy Array

Python list (flexible but slow):

my_list = [1, 2, 3, 4, 5]
# Each element is a PyObject pointer
# Memory layout: [ptr1, ptr2, ptr3, ptr4, ptr5]
#                  ↓     ↓     ↓     ↓     ↓
#                 [1]   [2]   [3]   [4]   [5]
# Non-contiguous, cache-unfriendly

NumPy array (fast but rigid):

my_array = np.array([1, 2, 3, 4, 5])
# Elements are directly in memory
# Memory layout: [1, 2, 3, 4, 5]
# Contiguous, cache-friendly

Tradeoff:

Case Study 2: TensorFlow Eager vs Graph Mode

Eager mode (TF 2.x default):

import tensorflow as tf

@tf.function  # Not JIT compiled
def my_model(x):
    return tf.matmul(x, W)

# Runs Python interpreter for every operation
result = my_model(input)  # Slow

Graph mode (TF 1.x, or with @tf.function):

@tf.function  # JIT compiled!
def my_model(x):
    return tf.matmul(x, W)

# First call: Build graph, compile
result = my_model(input)  # Slow (~1 second)

# Subsequent calls: Use compiled graph
result = my_model(input)  # Fast (~0.001 seconds)

Tradeoff:

Case Study 3: Flash Attention (Bypassing Abstractions)

Standard attention (via framework):

# PyTorch
scores = torch.matmul(Q, K.T)
attention = torch.softmax(scores)
output = torch.matmul(attention, V)

Problem: Each operation allocates intermediate tensors → memory bottleneck

Flash Attention (fused kernel):

// Custom CUDA kernel (bypasses PyTorch abstractions)
// Fuses matmul + softmax + matmul into one kernel
// Never materializes full scores matrix
flash_attention_kernel(Q, K, V, output);

Result: 10x faster by going UNDER the abstraction

Lesson: Sometimes you need to drop down a level for performance.


Part 7: Discussion Questions

Question 1: Is High-Level Code Always Slower?

Q: Python is 100x slower than C. Should we write everything in C?

A: No! Development speed matters.

Cost-benefit analysis:

Project: Web scraper

Python:
  - Development time: 2 hours
  - Runtime: 10 seconds per page
  - Total time (1000 pages): 2 hours dev + 2.8 hours run = 4.8 hours

C:
  - Development time: 20 hours (debugging, memory management)
  - Runtime: 0.1 seconds per page
  - Total time (1000 pages): 20 hours dev + 0.03 hours run = 20.03 hours

Python wins! (For this use case)

Use high-level when:

Use low-level when:

Question 2: What About "Zero-Cost Abstractions"?

Q: Rust promises "zero-cost abstractions." Is that possible?

A: Yes, but with caveats.

Example: Rust iterators

// High-level
let sum: i32 = (0..100).map(|x| x * 2).sum();

// Compiles to same assembly as:
let mut sum = 0;
for x in 0..100 {
    sum += x * 2;
}

The abstraction (iterator) has zero runtime cost - compiler optimizes it away!

But:

Tenstorrent TTNN aims for this: High-level API, but compiles to optimal low-level kernels.

Question 3: Should We Avoid Abstractions?

Q: Given that abstractions can hurt performance, should we avoid them?

A: No! Abstractions are necessary for complexity management.

Without abstractions:

Every programmer writes machine code
  → 100x slower development
  → More bugs (memory safety, concurrency)
  → No code reuse
  → Can't build complex systems

With abstractions:

Programmers use frameworks/libraries
  → 100x faster development
  → Fewer bugs (framework handles complexity)
  → Code reuse (DRY principle)
  → Can build complex systems (OSes, browsers, AI)

The solution: Use appropriate level of abstraction


Part 8: Real-World Example - AI Model Deployment

Scenario: Deploy a language model for production inference

Option 1: Pure PyTorch (Highest Level)

import torch

model = torch.load("model.pth")
model.eval()

def infer(input_text):
    tokens = tokenize(input_text)
    with torch.no_grad():
        output = model(tokens)
    return decode(output)

Pros:

Cons:

Option 2: TorchScript (Mid Level)

import torch

model = torch.jit.load("model.torchscript")

def infer(input_text):
    tokens = tokenize(input_text)
    output = model(tokens)
    return decode(output)

Pros:

Cons:

Option 3: vLLM on Tenstorrent (Low Level, Optimized)

# vLLM server (started separately)
# python -m vllm.entrypoints.api_server --model Llama-3.1-8B-Instruct

# Client code:
import requests

response = requests.post("http://localhost:8000/generate", json={
    "prompt": input_text,
    "max_tokens": 100
})

Pros:

Cons:

Performance comparison:

Pure PyTorch:   1 req/s    (baseline)
TorchScript:    5 req/s    (5x faster)
vLLM:         100 req/s    (100x faster!)

Why the difference?


Part 9: Key Takeaways

After this module, you should understand:

The Core Insight

Abstractions are tools, not rules.

The best programmers:


Part 10: Preview of Module 7 - Computational Complexity in Practice

We've explored the abstraction stack. But all these layers still execute algorithms. And algorithms have complexity.

Teaser questions:

  1. Big-O vs Reality: Why is O(n log n) sometimes slower than O(n²)?
  2. Constants matter: How did Flash Attention achieve O(n) in practice (not theory)?
  3. Hardware co-design: Can you change the algorithm to match the hardware?

Module 7 brings everything together: algorithms, hardware, and real performance.


Additional Resources

Compilation & Abstractions

Performance Optimization

Tenstorrent Resources


Summary

We explored:

Key lesson: Use the right abstraction level for the task. High-level for development, low-level for optimization.

Next: We explore how algorithms perform in practice on real hardware.

→ Continue to Module 7: Computational Complexity in Practice