N150 N300 T3K P100 P150 P300C Galaxy 30 min Draft

Module 1: RISC-V & Computer Architecture

Introduction: From Abstraction to Silicon Through RISC-V

As an industry professional, you've written countless lines of code. You understand APIs, frameworks, databases, and distributed systems. But what actually happens when you write x = a + b?

This series takes you from CS theory to real silicon. Not in a textbook, not in a simulator, but on hardware you can touch - with 880 RISC-V processors at your command.

What You'll Learn

By the end of this module, you'll understand:

✅ Von Neumann Architecture - The foundation of every computer since 1945
✅ Fetch-Decode-Execute Cycle - What CPUs actually do, millions of times per second
✅ RISC-V ISA - The instruction set that runs the world (and this chip)
✅ One Core Deeply - Master one processor before scaling to 880

Philosophy: Understanding 880 cores starts with understanding ONE core completely.

Part 1: CS Theory - What is a Computer?

⬡ Tensix Grid Visualizer Blackhole (P100/P150/P300c)

The Von Neumann Architecture (1945)

Every modern computer - your laptop, your phone, your GPU, this Tenstorrent chip - follows the same fundamental architecture proposed by John von Neumann in 1945:

flowchart TD
    MEM["Memory(Unified: Instructions + Data)"]
    CPU["Central Processing Unit"]
    CU["Control Unit(Fetch & Decode)"]
    ALU["Arithmetic Logic Unit(Execute)"]
    INPUT["Input(Keyboard, Network, etc.)"]
    OUTPUT["Output(Display, Network, etc.)"]

    INPUT --> MEM
    MEM <--> CPU
    CPU --> CU
    CPU --> ALU
    CU -.->|Control Signals| ALU
    MEM --> OUTPUT

    style MEM fill:#3293b2,stroke:#fff,color:#fff
    style CPU fill:#5347a4,stroke:#fff,color:#fff
    style CU fill:#499c8d,stroke:#fff,color:#fff
    style ALU fill:#499c8d,stroke:#fff,color:#fff

Key Insight: Instructions and data live in the same memory. The program is just data that tells the computer what to do.

The Fetch-Decode-Execute Cycle

Every processor, whether it's your laptop's Intel Core or this Tenstorrent BRISC, does the same thing repeatedly:

stateDiagram-v2
    [*] --> Fetch
    Fetch --> Decode : Instruction Ready
    Decode --> Execute : Instruction Decoded
    Execute --> WriteBack : Result Ready
    WriteBack --> Fetch : Next Instruction

    note right of Fetch
        Read instruction from memory
        Increment program counter (PC)
    end note

    note right of Decode
        Interpret instruction bits
        Identify operation and operands
    end note

    note right of Execute
        Perform the operation
        (add, load, store, etc.)
    end note

    note right of WriteBack
        Write result to register/memory
        Update processor state
    end note

This cycle runs billions of times per second. Understanding it deeply is the key to performance optimization.

Turing Completeness

A computer is Turing complete if it can:

Perform arbitrary arithmetic
Store and retrieve data from memory
Make conditional decisions (if/then)
Loop indefinitely

The BRISC processor we're about to program is Turing complete. With just 32 registers and a small instruction set, it can compute anything computable.

Part 2: Why This Matters (Industry Context)

"But I Write Python/Java/Go..."

You might think: "I haven't thought about CPU instructions since my undergrad CS course. Why does this matter now?"

Three reasons:

1. Performance Debugging

When your Python code is slow, it's often because:

Cache misses (memory hierarchy - Module 2)
False sharing (synchronization - Module 5)
Branch misprediction (speculation)

You can't fix what you don't understand.

2. Hardware Acceleration

Modern software runs on GPUs, TPUs, NPUs, and custom accelerators. They all follow the same principles:

Parallel execution (Module 3)
Explicit memory management (Module 2)
Network communication (Module 4)

Tenstorrent hardware makes these principles visible and programmable.

3. The Abstraction is Leaking

Your framework hides the hardware... until it doesn't:

Why is NumPy 100x faster than Python loops?
Why do GPUs need "kernel fusion"?
Why does "data locality" matter in databases?

Understanding the hardware explains the software.

Industry Examples

Google's TPUs: Custom matrix multiplication units with explicit SRAM management (just like what we'll program)

NVIDIA's GPUs: 10,000+ cores with explicit synchronization barriers (just like our NoC)

AWS Graviton: ARM processors with similar RISC architecture to RISC-V

This lesson teaches you principles that apply everywhere.

Part 3: On Real Hardware - The BRISC Processor

Meet Your 880 Computers

A Tenstorrent Wormhole chip contains 176 Tensix cores. Each Tensix contains 5 RISC-V processors:

graph TD
    subgraph TENSIX["Tensix Core (one of 176)"]
        BRISC["BRISC\n(Data Move · RV32IM)"]
        NCRISC["NCRISC\n(Data Move · RV32IM)"]
        TRISC0["TRISC0\n(Unpack · RV32IM)"]
        TRISC1["TRISC1\n(Math · RV32IM)"]
        TRISC2["TRISC2\n(Pack · RV32IM)"]
        L1["Shared: 1.5 MB L1 SRAM"]
        BRISC & NCRISC & TRISC0 & TRISC1 & TRISC2 --> L1
    end
    style TENSIX fill:#0F2A35,stroke:#4FD1C5,color:#E8F0F2
    style BRISC fill:#1A3C47,stroke:#4FD1C5,color:#4FD1C5
    style NCRISC fill:#1A3C47,stroke:#4FD1C5,color:#4FD1C5
    style TRISC0 fill:#1A3C47,stroke:#81E6D9,color:#81E6D9
    style TRISC1 fill:#1A3C47,stroke:#EC96B8,color:#EC96B8
    style TRISC2 fill:#1A3C47,stroke:#81E6D9,color:#81E6D9
    style L1 fill:#2D3142,stroke:#4FD1C5,color:#E8F0F2

176 Tensix × 5 RISC-V cores = 880 processors

Today, we focus on BRISC (RISCV_0) - the primary data movement processor.

RISC-V ISA: RV32IM

BRISC implements the RV32IM instruction set:

RV32: 32-bit architecture (registers and addresses)
I: Integer base instructions (add, load, store, branch)
M: Multiplication and division

Total instruction set: ~50 instructions. (Compare to x86's 1000+ instructions!)

RISC Philosophy: Simple instructions, executed fast. Complexity goes in the compiler, not the hardware.

The BRISC Programmer's Model

When you write a BRISC kernel, you have:

32 General-Purpose Registers:

x0  (zero)  - Always 0 (hardware enforced)
x1  (ra)    - Return address
x2  (sp)    - Stack pointer
x3  (gp)    - Global pointer
x4  (tp)    - Thread pointer
x5-x7       - Temporaries
x8-x9       - Saved registers
x10-x17     - Arguments/return values
x18-x27     - Saved registers
x28-x31     - Temporaries

Program Counter (PC):

Points to the next instruction to execute
Incremented after each fetch (PC += 4)
Modified by branches/jumps

Memory:

L1 SRAM (1.5 MB) - Directly addressable, shared with other cores
DRAM (1 GB+) - Accessed via NoC DMA only

Part 4: Hands-On - Run Your First RISC-V Program

Let's run the canonical first program: adding two integers.

Step 1: Build Programming Examples

First, build tt-metal with RISC-V programming examples enabled:

cd ~/tt-metal && \
  ./build_metal.sh --build-programming-examples

🔨 Build Programming Examples

VS Code

cd ~/tt-metal && ./build_metal.sh --build-programming-examples

This takes 5-10 minutes. The build system:

Compiles host C++ code (x86/ARM)
Compiles RISC-V kernels (using riscv32-gcc cross-compiler)
Links everything together

Step 2: Run the Addition Example

Now run the example:

cd ~/tt-metal && \
  export TT_METAL_DPRINT_CORES=0,0 && \
  ./build_Release/programming_examples/metal_example_add_2_integers_in_riscv

🚀 Run RISC-V Addition

VS Code

cd ~/tt-metal && export TT_METAL_DPRINT_CORES=0,0 && ./build_Release/programming_examples/metal_example_add_2_integers_in_riscv

Expected output:

Success: Result is 21
0:(x=0,y=0):BR: Adding integers: 14 + 7

Note: You'll see device initialization logs and possibly firmware version warnings before the output. These are normal - the system initializes all detected hardware before running your kernel. On multi-device systems (like QuietBox), you'll see initialization for all cards even though only device 0 is used.

🎉 Congratulations! You just ran a program on a RISC-V processor.

Step 3: Understand What Happened

Let's trace the execution step by step:

sequenceDiagram
    participant H as Host CPU(Your x86/ARM)
    participant D as DRAM(1 GB)
    participant B as BRISC(RISC-V Core 0,0)
    participant L as L1 SRAM(1.5 MB)

    H->>D: Write 14 to DRAM offset 0x0
    H->>D: Write 7 to DRAM offset 0x4
    H->>B: Start kernel with args
    Note over B: Kernel running on BRISC...
    B->>D: NoC DMA read (14)
    D-->>L: Transfer to L1 @ 0x1000
    B->>D: NoC DMA read (7)
    D-->>L: Transfer to L1 @ 0x1004
    Note over B: Wait for DMA completion
    B->>L: lw t0, 0x1000  (load 14)
    B->>L: lw t1, 0x1004  (load 7)
    Note over B: add t2, t0, t1
    Note over B: (14 + 7 = 21)
    B->>L: sw t2, 0x1008  (store 21)
    B->>D: NoC DMA write (21)
    L-->>D: Transfer from L1
    Note over B: Wait for DMA completion
    B->>H: Signal kernel complete
    H->>D: Read result (21)

Key Points:

Host controls setup - Writes input data, launches kernel
BRISC does the work - Fetches data, computes, writes back
NoC DMA moves data - Between DRAM and L1 SRAM
Explicit synchronization - Barriers wait for DMA completion

This is bare-metal programming. No OS, no abstractions, complete control.

Part 5: Dive Deeper - The RISC-V Kernel

Let's examine the actual kernel code:

📖 Open Kernel Source

VS Code

cd ~/tt-scratchpad/cookbook/game_of_life && export PYTHONPATH=~/tt-metal:$PYTHONPATH && python3 game_of_life.py

File: ~/tt-metal/tt_metal/programming_examples/add_2_integers_in_riscv/kernels/reader_writer_add_in_riscv.cpp

Runtime Arguments

The host passes arguments to the kernel:

void kernel_main() {
    // Host tells us where DRAM buffers are
    uint32_t src0_dram = get_arg_val<uint32_t>(0);  // Address of first integer
    uint32_t src1_dram = get_arg_val<uint32_t>(1);  // Address of second integer
    uint32_t dst_dram  = get_arg_val<uint32_t>(2);  // Where to write result

    // Host tells us where to use L1 SRAM
    uint32_t src0_l1   = get_arg_val<uint32_t>(3);  // L1 buffer for first int
    uint32_t src1_l1   = get_arg_val<uint32_t>(4);  // L1 buffer for second int
    uint32_t dst_l1    = get_arg_val<uint32_t>(5);  // L1 buffer for result

Why separate DRAM and L1 addresses?

DRAM: Large (1 GB+), slow access (~200 cycles via NoC)
L1 SRAM: Small (1.5 MB), fast access (~1 cycle direct)
Pattern: DMA from DRAM to L1, compute on L1, DMA back

NoC DMA Operations

Moving data from DRAM to L1:

    // Calculate NoC addresses (include X,Y coordinates)
    uint64_t src0_dram_noc_addr = get_noc_addr(0, src0);
    uint64_t src1_dram_noc_addr = get_noc_addr(0, src1);

    // Asynchronous DMA: Start transfers in parallel
    noc_async_read(src0_dram_noc_addr, src0_l1, sizeof(uint32_t));
    noc_async_read(src1_dram_noc_addr, src1_l1, sizeof(uint32_t));

    // Wait for both transfers to complete
    noc_async_read_barrier();

Key concept: Asynchronous DMA

noc_async_read() starts the transfer and returns immediately
BRISC can do other work while DMA happens
noc_async_read_barrier() blocks until all transfers finish

This is explicit parallelism - DMA happens on dedicated hardware while the processor continues.

The RISC-V Addition

Now the actual computation:

    // Cast L1 addresses to pointers
    uint32_t* dat0 = (uint32_t*)src0_l1;  // Points to L1 SRAM
    uint32_t* dat1 = (uint32_t*)src1_l1;
    uint32_t* out0 = (uint32_t*)dst_l1;

    // This C++ code compiles to RISC-V assembly:
    //   lw   t0, 0(a0)     # Load *dat0 into register t0
    //   lw   t1, 0(a1)     # Load *dat1 into register t1
    //   add  t2, t0, t1    # Add t0 + t1, result in t2
    //   sw   t2, 0(a2)     # Store t2 to *out0

    (*out0) = (*dat0) + (*dat1);

    // Debug print (visible with TT_METAL_DPRINT_CORES=0,0)
    DPRINT << "Adding integers: " << *dat0 << " + " << *dat1 << "\n";

From C++ to RISC-V:

The expression (*out0) = (*dat0) + (*dat1) becomes:

# RISC-V Assembly (what actually executes)
lw   t0, 0(a0)      # Load word from address in a0 → t0
lw   t1, 0(a1)      # Load word from address in a1 → t1
add  t2, t0, t1     # Add t0 and t1 → t2
sw   t2, 0(a2)      # Store word from t2 → address in a2

This is the fetch-decode-execute cycle in action:

Fetch: Read lw t0, 0(a0) from instruction memory
Decode: "Load word, source=memory[a0], dest=t0"
Execute: Read from L1 SRAM at address in a0
Write Back: Write value to register t0
Repeat for next instruction

Writing Result Back to DRAM

Finally, DMA the result back:

    // DMA write from L1 to DRAM
    uint64_t dst_dram_noc_addr = get_noc_addr(0, dst);
    noc_async_write(dst_l1, dst_dram_noc_addr, sizeof(uint32_t));

    // Wait for write to complete before kernel exits
    noc_async_write_barrier();
}

Why the barrier at the end?

If the kernel exits before DMA completes, the result might not be written
Explicit synchronization ensures correctness
Compare to x86 where writes "just work" (cache coherence hardware handles it)

Part 6: Experiments and Discussion

Experiment 1: Change the Operation

Modify the kernel to multiply instead of add:

// Change this line:
(*out0) = (*dat0) + (*dat1);

// To this:
(*out0) = (*dat0) * (*dat1);

Rebuild and run:

cd ~/tt-metal
./build_metal.sh --build-programming-examples
./build_Release/programming_examples/metal_example_add_2_integers_in_riscv

Question: How does the RISC-V assembly change?

Addition: add t2, t0, t1 (one instruction, one cycle)
Multiplication: mul t2, t0, t1 (one instruction, but ~3 cycles on RISC-V)

Experiment 2: Remove the Barrier

Comment out the DMA barrier:

noc_async_read(src0_dram_noc_addr, src0_l1, sizeof(uint32_t));
noc_async_read(src1_dram_noc_addr, src1_l1, sizeof(uint32_t));
// noc_async_read_barrier();  ← COMMENT THIS OUT

What happens?

The kernel might read garbage from L1 (DMA not complete yet)
You'll get wrong answers (race condition!)
This demonstrates why synchronization matters

Experiment 3: Measure DMA vs Compute Time

Modify the kernel to measure cycles:

uint64_t start = get_cycle_count();
noc_async_read(src0_dram_noc_addr, src0_l1, sizeof(uint32_t));
noc_async_read_barrier();
uint64_t dma_cycles = get_cycle_count() - start;

start = get_cycle_count();
(*out0) = (*dat0) + (*dat1);
uint64_t compute_cycles = get_cycle_count() - start;

DPRINT << "DMA: " << dma_cycles << " cycles, Compute: " << compute_cycles << " cycles\n";

Typical result:

DMA: ~200 cycles (memory latency)
Compute: ~4 cycles (load, load, add, store)

Insight: Memory access is 50x slower than computation!

This is why near-memory compute matters - Module 2 explores this deeply.

Part 7: Discussion Questions

Question 1: Why RISC vs CISC?

RISC (Reduced Instruction Set Computing):

Simple instructions (add, load, store)
Fixed instruction size (32 bits)
Easy to pipeline and parallelize
Compiler does the optimization

CISC (Complex Instruction Set Computing):

Complex instructions (e.g., x86's ADDPS adds 4 floats in one instruction)
Variable instruction size (1-15 bytes on x86)
More work per instruction, but harder to pipeline

Question: Why did Tenstorrent choose RISC-V over x86 or ARM?

Answer: Simpler hardware, easier to replicate 880 times, better for parallel workloads

Question 2: What Makes a Good Instruction Set?

Consider these design tradeoffs:

More instructions = More silicon area per core
Complex instructions = Harder to pipeline = Lower clock frequency
Simple instructions = More instructions per program = More instruction memory

Question: Is there a "perfect" ISA?

Answer: No! It depends on the workload. RISC-V is great for parallel data processing (what Tenstorrent does), x86 is great for sequential legacy code (what your laptop does).

Question 3: How Does This Compare to Your CPU?

Your laptop's CPU has:

Out-of-order execution - Instructions can execute in any order
Branch prediction - Guesses which way an if will go
Speculative execution - Executes both sides of an if before deciding
Cache coherence - Automatic synchronization between cores

BRISC has NONE of these.

Question: Is BRISC "worse" than your laptop's CPU?

Answer: No! It's different. BRISC is simple, predictable, and easy to replicate 880 times. Your laptop's CPU is complex, fast for single-threaded code, but hard to scale beyond ~16 cores.

Different tools for different jobs.

Part 8: Connections to Other Systems

GPUs (NVIDIA, AMD)

GPUs have thousands of simple cores (like BRISC) running SIMT (Single Instruction, Multiple Threads):

Each core executes the same instruction
Different cores process different data
Explicit memory hierarchy (global → shared → registers)
Explicit synchronization (__syncthreads())

Sound familiar? That's what we're learning here.

Cloud FPGAs (AWS F1, Azure NP-Series)

FPGAs let you design custom processors:

You could implement a RISC-V core in an FPGA
Full control over instruction set and memory hierarchy
But: Much more complex to program

Tenstorrent gives you FPGA-like control with CPU-like programming.

Embedded RISC-V (SiFive, ESP32-C3)

Other RISC-V chips exist, but typically:

1-4 cores (vs our 880)
No network-on-chip (single shared bus)
KB of memory (vs our 1.5 MB L1 per core)

Tenstorrent is RISC-V at datacenter scale.

Part 9: Key Takeaways

After completing this module, you should understand:

✅ Von Neumann Architecture - The fetch-decode-execute cycle is fundamental
✅ RISC-V ISA - Simple instructions executed quickly
✅ Memory Hierarchy - DRAM is slow, L1 SRAM is fast (Module 2 deepens this)
✅ Explicit Control - No OS, no abstractions, you control everything
✅ One Core Mastered - Now we can scale to 880 cores (Module 3)

What We Skipped (For Now)

Pipelining - How CPUs overlap instructions (comes up in Module 7)
Memory addressing modes - We only used direct loads/stores
The other 4 RISC-V cores - NCRISC, TRISC0/1/2 (Module 6 covers this)
Multi-core communication - How 880 cores talk to each other (Module 4)

We're building up systematically. Each module adds one concept.

Part 10: Next Steps

Recommended Experiments

Before moving to Module 2, try these:

Modify the integers: Change from 14+7 to 1000000+2000000
Try division: Replace + with / and see the RISC-V div instruction
Add a third integer: Extend the kernel to compute a + b + c
Measure instruction count: Use get_cycle_count() to profile execution

Preview of Module 2: Memory Hierarchy

Next, we'll explore why DRAM is 50x slower than L1:

Cache locality (spatial and temporal)
Bandwidth vs latency tradeoffs
Near-memory compute advantages
How to structure data for performance

Teaser question: We transferred 4 bytes (one integer) from DRAM to L1. That took ~200 cycles. If we transferred 4 KB (1000 integers), would it take 200,000 cycles?

Spoiler: No! Bandwidth is different from latency. Module 2 explains why.

Additional Resources

RISC-V Learning

RISC-V ISA Spec: https://riscv.org/technical/specifications/
RISC-V Reader Book: Free e-book, excellent introduction
RV32IM Reference Card: One-page summary of all instructions

Tenstorrent Resources

Metalium Guide: ~/tt-metal/METALIUM_GUIDE.md
Programming Examples: ~/tt-metal/tt_metal/programming_examples/
Tech Reports: ~/tt-metal/tech_reports/prog_examples/

Community

Tenstorrent Discord: https://discord.gg/tenstorrent
GitHub: https://github.com/tenstorrent/tt-metal

Summary: From Theory to Practice

We started with CS theory:

Von Neumann architecture (1945)
Fetch-decode-execute cycle (every computer)
Turing completeness (what makes a computer universal)

We connected to industry:

Why abstractions leak (Python hides this, but it's still happening)
How hardware acceleration works (GPUs do the same thing)
Performance debugging (understanding instructions = understanding performance)

We ran it on real hardware:

880 RISC-V cores (176 Tensix × 5 cores each)
BRISC processor (RV32IM instruction set)
Bare-metal programming (no OS, complete control)

Next module: We explore why memory is slow and what to do about it.

Welcome to CS fundamentals on real hardware. Let's continue! 🚀

→ Continue to Module 2: Memory Hierarchy