Exploring Tenstorrent as a RISC-V Assembly Programming Platform

Introduction: An Unconventional RISC-V Environment

When most people think of RISC-V programming, they imagine embedded development boards like SiFive's HiFive or ESP32-C3 microcontrollers. But Tenstorrent's Wormhole and Blackhole accelerator cards offer something far more exotic: hundreds of RISC-V cores networked together on a single chip, each with direct access to 1.5MB of local SRAM and connected via a high-performance Network-on-Chip (NoC).

This isn't your typical embedded RISC-V environment. Each Tensix core on a Tenstorrent processor contains five independent RISC-V CPUs working in concert - two for data movement, three for compute pipeline stages. Rather than being hidden behind abstraction layers, these processors are directly programmable, offering a unique platform for exploring RISC-V assembly programming, parallel computing, and near-memory compute architectures.

This guide explores Tenstorrent hardware from the perspective of a RISC-V programmer, revealing the low-level architecture and providing hands-on examples of programming these processors directly.

Part 1: Architecture Deep-Dive

The Tensix Core: Five RISC-V Processors Working Together

Each Tensix core is a complete compute unit containing:

┌─────────────────────────────────────────────────┐
│                  TENSIX CORE                    │
├─────────────────────────────────────────────────┤
│                                                 │
│  ┌──────────────┐        ┌──────────────┐      │
│  │   BRISC      │        │   NCRISC     │      │
│  │ (Data Move 0)│        │ (Data Move 1)│      │
│  │  RISCV_0     │        │  RISCV_1     │      │
│  └──────────────┘        └──────────────┘      │
│                                                 │
│  ┌──────────────┬──────────────┬──────────────┐│
│  │   TRISC0     │   TRISC1     │   TRISC2     ││
│  │   (Unpack)   │   (Math)     │   (Pack)     ││
│  └──────────────┴──────────────┴──────────────┘│
│                                                 │
│  ┌─────────────────────────────────────────┐   │
│  │         1.5 MB L1 SRAM                  │   │
│  │         (Shared Memory)                 │   │
│  └─────────────────────────────────────────┘   │
│                                                 │
│  ┌─────────────────────────────────────────┐   │
│  │  Matrix Engine (FPU) + Vector (SFPU)   │   │
│  └─────────────────────────────────────────┘   │
│                                                 │
│  ┌──────────────┐        ┌──────────────┐      │
│  │   NoC 0      │        │   NoC 1      │      │
│  │  Interface   │        │  Interface   │      │
│  └──────────────┘        └──────────────┘      │
└─────────────────────────────────────────────────┘

The Five RISC-V Processors

1. BRISC (Base RISC) - RISCV_0 / Data Movement 0

Purpose: Primary data movement processor
Firmware: brisc.cc
Typical tasks: Reading data from DRAM/other cores via NoC
Memory regions: 4KB local memory, 6KB firmware space, 48KB kernel space
Memory base: 0xFFB00000 (local), firmware at mailbox end

2. NCRISC (Network Core RISC) - RISCV_1 / Data Movement 1

Purpose: Secondary data movement processor, network operations
Firmware: ncrisc.cc
Special feature: Has dedicated IRAM at 0xFFC00000 (16KB)
Typical tasks: Writing data to DRAM/other cores via NoC
Memory regions: 4KB local memory, 2KB firmware space, 16KB IRAM for kernels

3-5. TRISC0, TRISC1, TRISC2 (Tensor RISC) - Compute Pipeline

TRISC0 (Unpack): Moves data from L1 SRAM into compute engine registers
TRISC1 (Math): Issues instructions to FPU and SFPU compute engines
TRISC2 (Pack): Writes results from compute engines back to L1 SRAM
Firmware: trisc.cc (shared codebase)
Memory regions: 2KB local memory each, 1.5KB firmware, 24KB kernel space

RISC-V ISA: RV32IM

All five processors implement the RV32IM instruction set:

RV32I: Base integer instruction set (32-bit)
M Extension: Integer multiplication and division

Key characteristics:

No hardware threads - Single-threaded execution per core
No caches - Explicit DMA operations for memory access
No FPU in RISC-V cores - Floating point handled by dedicated hardware engines
Bare-metal execution - No OS, no virtual memory, direct hardware access

Memory Architecture: A RISC-V Programmer's View

L1 SRAM (1.5MB per Tensix)

Base Address: 0x00000000
Size:         1464 KB (1.5 MB)
Access:       Shared across all 5 RISC-V cores in the Tensix
              Also accessible by other Tensix cores via NoC
Purpose:      - Circular buffers for inter-kernel communication
              - Temporary data storage
              - Code execution space for kernels

Local Memory (Per-Processor Private Memory)

BRISC:  0xFFB00000 - 0xFFB00FFF (4 KB)
NCRISC: 0xFFB01000 - 0xFFB01FFF (4 KB)
TRISC:  0xFFB02000 - 0xFFB027FF (2 KB each)

Purpose: Stack, local variables, processor-specific data

NCRISC IRAM (Instruction RAM)

Base Address: 0xFFC00000
Size:         16 KB
Purpose:      Fast instruction execution for NCRISC kernels
              (Wormhole architecture feature)

DRAM

Size:      1 GB per chip (distributed across DRAM controllers)
Access:    Via NoC DMA operations only
           Not directly addressable from RISC-V cores

The Mailbox: Inter-Processor Communication

Located at MEM_MAILBOX_BASE (offset 16 in L1):

Address: 0x00000010 - 0x000031BF
Size:    12,768 bytes

Contains:
- Device messages (dev_msgs_t structure)
- Runtime arguments passed from host
- Synchronization flags
- NCRISC halt/resume stack pointer (offset +4)

The mailbox is the primary mechanism for:

Host-to-device communication
Passing kernel arguments at runtime
Inter-processor synchronization

Part 2: The Toolchain and Build System

Compilation Pipeline

Tenstorrent uses a standard RISC-V GCC toolchain with custom linker scripts:

┌─────────────────┐
│  Kernel Code    │  (C++ with device APIs)
│  example.cpp    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  riscv32-gcc    │  (Cross compiler)
│  -march=rv32im  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Linker         │  (Custom linker scripts)
│  main.ld        │  - Separate sections per processor
│                 │  - Firmware vs. kernel regions
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  ELF Binary     │  (elf32-littleriscv)
│  (per core)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  tt-metal       │  (Runtime loads onto device)
│  Host API       │
└─────────────────┘

Linker Script Structure

The main linker script (tt_metal/hw/toolchain/main.ld) separates memory regions per processor:

OUTPUT_FORMAT("elf32-littleriscv", "elf32-littleriscv", "elf32-littleriscv")
OUTPUT_ARCH(riscv)

/* Conditional compilation per processor */
#if defined(COMPILE_FOR_BRISC)
    #define TEXT_START MEM_BRISC_FIRMWARE_BASE
    #define TEXT_SIZE  MEM_BRISC_FIRMWARE_SIZE
    #define DATA_START MEM_LOCAL_BASE
    #define DATA_SIZE  MEM_BRISC_LOCAL_SIZE
#elif defined(COMPILE_FOR_NCRISC)
    #define TEXT_START MEM_NCRISC_KERNEL_BASE  /* IRAM! */
    #define TEXT_SIZE  MEM_NCRISC_KERNEL_SIZE
    /* ... */
#endif

SECTIONS {
    .text TEXT_START : {
        *(.start)          /* Assembly entry point */
        *(.text .text.*)   /* Code */
    }
    .data DATA_START : {
        *(.data .data.*)   /* Initialized data */
    }
    .bss : {
        *(.bss .bss.*)     /* Uninitialized data */
    }
}

Assembly Startup: crt0.S

Every RISC-V program starts with _start in tmu-crt0.S:

.section .start,"ax",@progbits
.global _start
.type   _start, @function

_start:
    /* Initialize global pointer (gp register) */
    .option push
    .option norelax
    lui  gp, %hi(__global_pointer$)
    addi gp, gp, %lo(__global_pointer$)
    .option pop

    /* Set stack pointer */
    lui  sp, %hi(__stack_top - 16)
    addi sp, sp, %lo(__stack_top - 16)

    /* Pass Tensix coordinates as argv[0] */
    addi a0, sp, 8
    sw   a0, 0(sp)      /* argv[0] */
    sw   zero, 4(sp)    /* argv[1] = NULL */
    sw   s1, 8(sp)      /* Coordinates in s1 */
    sw   zero, 12(sp)

    li   a0, 1          /* argc = 1 */
    mv   a1, sp         /* argv */
    mv   a2, zero       /* env = NULL */

    /* Call main, then exit */
    call main
    tail exit

Key insights:

Global pointer (gp): Used for efficient access to small data section
Stack setup: Each processor has its own stack in local memory
Tensix coordinates: Passed via s1 register (set by hardware)
No OS: Direct jump to main(), no libc initialization

Part 3: Hands-On Example - Adding Two Integers in RISC-V

Let's walk through the canonical example from tt-metal: add_2_integers_in_riscv.

High-Level Flow

┌──────────────┐
│     Host     │
│   (C++ API)  │
└──────┬───────┘
       │ 1. Create buffers in DRAM
       │ 2. Upload integers (14, 7)
       │ 3. Launch kernel on BRISC
       ▼
┌──────────────────────────────────┐
│  Tensix Core (0,0)               │
│                                  │
│  ┌────────────────────────────┐ │
│  │  BRISC Kernel              │ │
│  │  (Data Movement Core)      │ │
│  │                            │ │
│  │  1. Read 14 from DRAM → L1│ │
│  │  2. Read 7 from DRAM → L1 │ │
│  │  3. Add: 14 + 7 = 21       │ │ ← RISC-V addition!
│  │  4. Write 21 to DRAM       │ │
│  └────────────────────────────┘ │
└──────────────────────────────────┘
       │
       ▼
┌──────────────┐
│     Host     │
│  Read result │
│  (21)        │
└──────────────┘

The Kernel Code (Device Side)

File: tt_metal/programming_examples/add_2_integers_in_riscv/kernels/reader_writer_add_in_riscv.cpp

void kernel_main() {
    // Get runtime arguments (DRAM and L1 addresses)
    uint32_t src0_dram = get_arg_val<uint32_t>(0);
    uint32_t src1_dram = get_arg_val<uint32_t>(1);
    uint32_t dst_dram  = get_arg_val<uint32_t>(2);
    uint32_t src0_l1   = get_arg_val<uint32_t>(3);
    uint32_t src1_l1   = get_arg_val<uint32_t>(4);
    uint32_t dst_l1    = get_arg_val<uint32_t>(5);

    // Create address generators (for DRAM buffers)
    InterleavedAddrGen<true> src0 = {
        .bank_base_address = src0_dram,
        .page_size = sizeof(uint32_t)
    };
    InterleavedAddrGen<true> src1 = {
        .bank_base_address = src1_dram,
        .page_size = sizeof(uint32_t)
    };
    InterleavedAddrGen<true> dst = {
        .bank_base_address = dst_dram,
        .page_size = sizeof(uint32_t)
    };

    // ═══════════════════════════════════════════════════════
    // STEP 1: DMA from DRAM to L1 SRAM (via NoC)
    // ═══════════════════════════════════════════════════════
    uint64_t src0_dram_noc_addr = get_noc_addr(0, src0);
    uint64_t src1_dram_noc_addr = get_noc_addr(0, src1);

    noc_async_read(src0_dram_noc_addr, src0_l1, sizeof(uint32_t));
    noc_async_read(src1_dram_noc_addr, src1_l1, sizeof(uint32_t));
    noc_async_read_barrier();  // Wait for DMA to complete

    // ═══════════════════════════════════════════════════════
    // STEP 2: THE RISC-V ADDITION (Running on BRISC)
    // ═══════════════════════════════════════════════════════
    uint32_t* dat0 = (uint32_t*)src0_l1;  // Pointer to L1 SRAM
    uint32_t* dat1 = (uint32_t*)src1_l1;
    uint32_t* out0 = (uint32_t*)dst_l1;

    // This is compiled to RISC-V add instruction!
    (*out0) = (*dat0) + (*dat1);

    // Optional: Print (requires TT_METAL_DPRINT_CORES=0,0)
    DPRINT << "Adding: " << *dat0 << " + " << *dat1 << "\n";

    // ═══════════════════════════════════════════════════════
    // STEP 3: DMA from L1 back to DRAM
    // ═══════════════════════════════════════════════════════
    uint64_t dst_dram_noc_addr = get_noc_addr(0, dst);
    noc_async_write(dst_l1, dst_dram_noc_addr, sizeof(uint32_t));
    noc_async_write_barrier();  // Wait for write to complete
}

What Really Happens in Assembly

When you compile this kernel, the addition becomes RISC-V assembly:

# Load from L1 SRAM (dat0 and dat1 are pointers in L1)
lw   t0, 0(a0)    # Load *dat0 into t0
lw   t1, 0(a1)    # Load *dat1 into t1

# RISC-V addition
add  t2, t0, t1   # t2 = t0 + t1

# Store result back to L1 SRAM
sw   t2, 0(a2)    # Store t2 into *out0

Key insight: While the C++ API provides noc_async_read/write for DMA operations, the actual arithmetic happens in plain RISC-V instructions executing on the BRISC processor.

Host Code (Orchestration)

File: tt_metal/programming_examples/add_2_integers_in_riscv/add_2_integers_in_riscv.cpp

int main() {
    // Create 1x1 mesh (single device)
    auto mesh_device = distributed::MeshDevice::create_unit_mesh(0);
    distributed::MeshCommandQueue& cq = mesh_device->mesh_command_queue();

    // Create DRAM and L1 buffers
    constexpr uint32_t buffer_size = sizeof(uint32_t);
    auto src0_dram_buffer = distributed::MeshBuffer::create(
        buffer_config, dram_config, mesh_device.get());
    auto src1_dram_buffer = /* ... */;
    auto dst_dram_buffer = /* ... */;
    auto src0_l1_buffer = distributed::MeshBuffer::create(
        buffer_config, l1_config, mesh_device.get());
    auto src1_l1_buffer = /* ... */;
    auto dst_l1_buffer = /* ... */;

    // Upload integers to DRAM
    std::vector<uint32_t> src0_vec = {14};
    std::vector<uint32_t> src1_vec = {7};
    EnqueueWriteMeshBuffer(cq, src0_dram_buffer, src0_vec, false);
    EnqueueWriteMeshBuffer(cq, src1_dram_buffer, src1_vec, false);

    // Create kernel that runs on BRISC (Data Movement 0)
    Program program = CreateProgram();
    KernelHandle kernel_id = CreateKernel(
        program,
        "add_2_integers_in_riscv/kernels/reader_writer_add_in_riscv.cpp",
        CoreCoord{0, 0},  // Tensix at (0,0)
        DataMovementConfig{
            .processor = DataMovementProcessor::RISCV_0,  // BRISC
            .noc = NOC::RISCV_0_default
        });

    // Pass addresses as runtime arguments
    SetRuntimeArgs(program, kernel_id, CoreCoord{0, 0}, {
        src0_dram_buffer->address(),
        src1_dram_buffer->address(),
        dst_dram_buffer->address(),
        src0_l1_buffer->address(),
        src1_l1_buffer->address(),
        dst_l1_buffer->address(),
    });

    // Execute!
    distributed::MeshWorkload workload;
    workload.add_program(device_range, std::move(program));
    distributed::EnqueueMeshWorkload(cq, workload, false);

    // Read result
    std::vector<uint32_t> result_vec;
    distributed::EnqueueReadMeshBuffer(cq, result_vec, dst_dram_buffer, true);

    std::cout << "Result: " << result_vec[0] << std::endl;  // 21
    mesh_device->close();
}

Part 4: The NoC - Network-on-Chip Architecture

What is the NoC?

The Network-on-Chip (NoC) is a 2D mesh interconnect that connects:

All Tensix cores
DRAM controllers
PCIe interfaces
Ethernet cores (for multi-chip)

Wormhole NoC Grid (Example):
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ D │ T │ T │ T │ T │ T │ T │ T │ T │ T │ T │ D │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ D │ T │ T │ T │ T │ T │ T │ T │ T │ T │ T │ D │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ D │ T │ T │ T │ T │ T │ T │ T │ T │ T │ T │ D │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ E │ T │ T │ T │ T │ P │ A │ T │ T │ T │ T │ E │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

Legend:
T = Tensix core (each with 5 RISC-V CPUs)
D = DRAM controller
E = Ethernet (for multi-chip)
P = PCIe
A = ARC (management processor)

NoC Programming Model

From a RISC-V programmer's perspective, NoC operations are asynchronous DMA transactions:

// Read from remote location (DRAM or another Tensix's L1)
uint64_t remote_addr = get_noc_addr(x, y, local_offset);
noc_async_read(remote_addr, local_l1_addr, size);
noc_async_read_barrier();  // Wait for completion

// Write to remote location
noc_async_write(local_l1_addr, remote_addr, size);
noc_async_write_barrier();

NoC address encoding:

63        48 47      40 39              0
┌───────────┬─────────┬──────────────────┐
│  NoC Y    │  NoC X  │  Local Address   │
└───────────┴─────────┴──────────────────┘

Helper function:

uint64_t get_noc_addr(uint32_t x, uint32_t y, uint32_t addr) {
    return ((uint64_t)y << 48) | ((uint64_t)x << 40) | addr;
}

Example: Reading from Another Tensix Core

// Read from L1 SRAM of Tensix at (3, 4)
constexpr uint32_t remote_x = 3;
constexpr uint32_t remote_y = 4;
constexpr uint32_t remote_l1_addr = 0x1000;
constexpr uint32_t local_l1_addr = 0x2000;

uint64_t noc_addr = get_noc_addr(remote_x, remote_y, remote_l1_addr);
noc_async_read(noc_addr, local_l1_addr, 1024);  // Read 1KB
noc_async_read_barrier();

// Now data is in local L1 at 0x2000
uint32_t* data = (uint32_t*)local_l1_addr;

Part 5: Writing Pure Assembly Kernels (Advanced)

While most kernels are written in C++, you can write pure RISC-V assembly.

Example: Assembly Addition Kernel

File: my_asm_add.S

.section .text
.globl kernel_main
.type kernel_main, @function

kernel_main:
    # Save return address
    addi sp, sp, -16
    sw   ra, 12(sp)

    # Get runtime arguments from mailbox
    # get_arg_val is a C++ function, but we can call it
    li   a0, 0              # arg index 0
    call get_arg_val        # returns src0_l1 address in a0
    mv   s0, a0             # save in s0

    li   a0, 1              # arg index 1
    call get_arg_val        # returns src1_l1 address
    mv   s1, a0             # save in s1

    li   a0, 2              # arg index 2
    call get_arg_val        # returns dst_l1 address
    mv   s2, a0             # save in s2

    # Load operands from L1 SRAM
    lw   t0, 0(s0)          # Load *src0_l1
    lw   t1, 0(s1)          # Load *src1_l1

    # THE ADDITION!
    add  t2, t0, t1

    # Store result to L1 SRAM
    sw   t2, 0(s2)          # Store to *dst_l1

    # Restore and return
    lw   ra, 12(sp)
    addi sp, sp, 16
    ret

.size kernel_main, .-kernel_main

Inline Assembly in C++ Kernels

You can also embed assembly directly:

void kernel_main() {
    uint32_t* src0 = (uint32_t*)get_arg_val<uint32_t>(0);
    uint32_t* src1 = (uint32_t*)get_arg_val<uint32_t>(1);
    uint32_t* dst  = (uint32_t*)get_arg_val<uint32_t>(2);

    uint32_t result;

    // Inline assembly for addition
    asm volatile (
        "lw   t0, 0(%1)\n"      // Load *src0
        "lw   t1, 0(%2)\n"      // Load *src1
        "add  t2, t0, t1\n"     // Add
        "sw   t2, 0(%0)\n"      // Store to result
        : "=r" (result)         // Output
        : "r" (src0), "r" (src1) // Inputs
        : "t0", "t1", "t2"      // Clobbers
    );

    *dst = result;
}

Part 6: Parallel RISC-V Programming

Multi-Core Execution

Launch the same kernel on multiple Tensix cores:

// Run on 4x4 grid of cores
CoreRange core_range = {{0, 0}, {3, 3}};  // (0,0) to (3,3)

KernelHandle kernel_id = CreateKernel(
    program, "my_kernel.cpp", core_range,
    DataMovementConfig{.processor = DataMovementProcessor::RISCV_0});

// Set DIFFERENT runtime arguments per core
for (uint32_t x = 0; x < 4; x++) {
    for (uint32_t y = 0; y < 4; y++) {
        CoreCoord core{x, y};

        // Calculate which data this core processes
        uint32_t data_offset = (y * 4 + x) * chunk_size;

        SetRuntimeArgs(program, kernel_id, core, {
            input_buffer->address() + data_offset,
            output_buffer->address() + data_offset,
            chunk_size
        });
    }
}

Each BRISC processor executes the same code but with different arguments!

Getting Core Coordinates in Kernel

void kernel_main() {
    // Built-in variables (set by firmware)
    uint32_t my_x = my_logical_x_;
    uint32_t my_y = my_logical_y_;

    // Compute unique ID
    uint32_t core_id = my_y * grid_width + my_x;

    // Process data based on core ID
    uint32_t offset = core_id * CHUNK_SIZE;
    // ...
}

Inter-Core Communication via NoC

// Core (0,0) sends data to Core (1,0)
void kernel_main() {
    if (my_logical_x_ == 0 && my_logical_y_ == 0) {
        // Sender core
        uint32_t data[256];
        // ... fill data ...

        uint64_t dest_addr = get_noc_addr(1, 0, 0x1000);
        noc_async_write((uint32_t)data, dest_addr, sizeof(data));
        noc_async_write_barrier();
    } else if (my_logical_x_ == 1 && my_logical_y_ == 0) {
        // Receiver core
        uint32_t* received = (uint32_t*)0x1000;
        // ... wait for data arrival ...
        // ... process received data ...
    }
}

Part 7: Debugging and Profiling

Debug Printing from RISC-V Cores

Enable debug printing:

export TT_METAL_DPRINT_CORES=0,0  # Enable for Tensix (0,0)

In kernel:

#include "debug/dprint.h"

void kernel_main() {
    DPRINT << "Hello from BRISC at ("
           << my_logical_x_ << "," << my_logical_y_ << ")\n";

    uint32_t value = 42;
    DPRINT << "Value: " << value << "\n";
}

Output appears on host stdout.

Profiling RISC-V Execution

export TT_METAL_DEVICE_PROFILER=1

This enables cycle-accurate profiling of:

Kernel execution time per core
NoC transaction latency
Time spent in each RISC-V processor

Register Inspection (Advanced)

The firmware exposes register state via mailbox. You can read processor state from host:

// Read BRISC instruction pointer (example)
auto mailbox_addr = device->get_mailbox_address(core);
auto pc_value = device->read_l1(mailbox_addr + PC_OFFSET, sizeof(uint32_t));

Part 8: Comparison to Other RISC-V Platforms

Tenstorrent vs. Traditional RISC-V Boards

Feature	Tenstorrent Wormhole	SiFive HiFive	ESP32-C3
RISC-V Cores	880 (5 per Tensix × 176)	1-5 cores	1 core
ISA	RV32IM	RV64GC	RV32IMC
Clock Speed	~1 GHz	~1.5 GHz	160 MHz
L1 per core	1.5 MB shared	32 KB	400 KB
Interconnect	2D NoC mesh	AXI bus	Single bus
Programming	C++ device API	Bare-metal C/ASM	FreeRTOS/bare-metal
Use Case	AI accelerator	Linux SBC	IoT embedded
Unique Feature	Hundreds of cores + dedicated matrix/vector engines	Standard Linux workstation	WiFi/BLE integrated

Key Differences

Advantages of Tenstorrent for RISC-V exploration:

✅ Massive parallelism: 880 RISC-V cores on a single chip
✅ Near-memory compute: 1.5MB L1 per Tensix, no cache hierarchy
✅ High-bandwidth interconnect: NoC enables core-to-core communication at 100+ GB/s aggregate
✅ Explicit control: No OS, no hidden behavior, deterministic execution

Challenges:

❌ Not general-purpose: Designed for AI workloads, not typical embedded tasks
❌ Limited peripherals: No GPIO, UART, SPI in traditional sense
❌ Learning curve: Requires understanding NoC, mesh architecture, and tt-metal API

Part 9: Real-World Examples in tt-metal

Matrix Multiplication (SPMD Parallelism)

Found in tt_metal/programming_examples/matmul/:

Multiple Tensix cores each process a tile of the matrix
Data movement cores (BRISC/NCRISC) load tiles from DRAM
Compute cores (TRISC) orchestrate FPU operations
Results written back via DMA

Multicast Communication

Found in tech_reports/prog_examples/multicast/:

One Tensix broadcasts data to multiple receivers simultaneously
Uses NoC multicast addressing
Efficient for distributing weights in neural networks

Flash Attention

Found in tech_reports/FlashAttention/:

Tiled attention mechanism
Each Tensix processes a query/key/value tile
Heavy use of L1 SRAM to avoid DRAM bottlenecks
RISC-V cores orchestrate data movement between tiles

Part 10: Getting Started - Build and Run

Prerequisites

# Clone tt-metal
git clone https://github.com/tenstorrent/tt-metal.git
cd tt-metal

# Install dependencies
./install_dependencies.sh

# Build with programming examples
./build_metal.sh --build-programming-examples

Run the RISC-V Addition Example

export TT_METAL_HOME=$(pwd)
export TT_METAL_DPRINT_CORES=0,0  # Enable debug output

./build/programming_examples/add_2_integers_in_riscv

Expected output:

Adding integers: 14 + 7
Success: Result is 21

Exploring the Firmware

# View BRISC firmware source
cat tt_metal/hw/firmware/src/tt-1xx/brisc.cc

# View assembly startup
cat tt_metal/hw/toolchain/tmu-crt0.S

# View linker script
cat tt_metal/hw/toolchain/main.ld

# View memory map
cat tt_metal/hw/inc/tt-1xx/wormhole/dev_mem_map.h

Writing Your Own Kernel

Create my_kernel.cpp in a new directory
Use the device API: get_arg_val, noc_async_read/write, etc.
Compile via CreateKernel API
Launch from host with SetRuntimeArgs and EnqueueProgram

Example skeleton:

// my_kernel.cpp
#include "dataflow_api.h"

void kernel_main() {
    // Get arguments
    uint32_t arg0 = get_arg_val<uint32_t>(0);

    // Your RISC-V code here!
    uint32_t result = arg0 * 2;

    // Write to L1
    uint32_t* output = (uint32_t*)0x1000;
    *output = result;
}

Part 11: Advanced Topics

DMA Optimization

Maximize NoC bandwidth:

// Bad: Sequential reads (latency adds up)
for (int i = 0; i < 1000; i++) {
    noc_async_read(addr + i * 32, local + i * 32, 32);
    noc_async_read_barrier();  // DON'T DO THIS IN LOOP!
}

// Good: Batch reads, single barrier
for (int i = 0; i < 1000; i++) {
    noc_async_read(addr + i * 32, local + i * 32, 32);
}
noc_async_read_barrier();  // Wait once at the end

Circular Buffers (Advanced Inter-Kernel Communication)

Used for producer-consumer patterns between BRISC and TRISC:

// In reader kernel (BRISC)
constexpr uint32_t cb_id = tt::CBIndex::c_0;
cb_reserve_back(cb_id, 1);  // Reserve space
uint32_t write_ptr = get_write_ptr(cb_id);
noc_async_read(src, write_ptr, tile_size);
noc_async_read_barrier();
cb_push_back(cb_id, 1);  // Signal data ready

// In compute kernel (TRISC)
cb_wait_front(cb_id, 1);  // Wait for data
uint32_t read_ptr = get_read_ptr(cb_id);
// ... process data ...
cb_pop_front(cb_id, 1);  // Release buffer

Custom Firmware (Experimental)

You can modify the base firmware (e.g., brisc.cc) to change boot behavior, but this requires rebuilding the entire firmware image. Not recommended for most users.

Part 12: Limitations and Gotchas

What You CAN'T Do

No dynamic memory allocation: No malloc(), new, etc. All buffers must be pre-allocated.
No standard library: No printf, fopen, etc. Use device APIs instead.
No interrupts: Polling-based synchronization only.
No virtual memory: All addresses are physical.
No floating-point in RISC-V cores: Use the FPU/SFPU engines via TRISC instead.

Common Pitfalls

1. Forgetting barriers:

noc_async_read(src, dst, size);
// BUG: Data might not be ready yet!
uint32_t* data = (uint32_t*)dst;
uint32_t value = data[0];  // May read garbage!

// FIX: Add barrier
noc_async_read(src, dst, size);
noc_async_read_barrier();  // Wait!
uint32_t* data = (uint32_t*)dst;
uint32_t value = data[0];  // Safe

2. Incorrect NoC addressing:

// BUG: Forgot to encode X/Y coordinates
uint64_t addr = 0x1000;  // Missing NoC coordinates!
noc_async_read(addr, local, size);  // Will fail!

// FIX: Use get_noc_addr
uint64_t addr = get_noc_addr(x, y, 0x1000);
noc_async_read(addr, local, size);  // Correct

3. Stack overflow: Each RISC-V core has limited stack space (256 bytes minimum). Avoid large local arrays:

// BAD
void kernel_main() {
    uint32_t big_array[1000];  // 4KB - WILL OVERFLOW STACK!
    // ...
}

// GOOD
void kernel_main() {
    uint32_t* big_array = (uint32_t*)0x10000;  // Use L1 instead
    // ...
}

Conclusion: A Unique RISC-V Playground

Tenstorrent's Wormhole and Blackhole cards represent a rare opportunity to explore RISC-V programming at scale. Unlike traditional embedded boards with a handful of cores, these accelerators pack hundreds of RISC-V processors on a single chip, all connected via a high-performance mesh network and backed by massive on-chip SRAM.

What makes this platform special:

Bare-metal access: No OS, no hidden behavior, direct hardware control
Massive parallelism: 880 RISC-V cores working in concert
Near-memory compute: 1.5MB L1 per Tensix eliminates memory bottlenecks
Explicit communication: NoC programming teaches distributed systems concepts
Production hardware: Not a research prototype - real AI accelerators in the field

Who should explore this:

RISC-V enthusiasts: Want to program RV32IM at scale
Parallel programming students: Learn distributed computing with real hardware
Embedded developers: Understand bare-metal programming without an OS
Computer architects: Study NoC, near-memory compute, and tiled architectures
AI researchers: Optimize kernels at the lowest level

Next Steps:

Build tt-metal and run add_2_integers_in_riscv
Study the programming examples in tt_metal/programming_examples/
Modify existing kernels to experiment with RISC-V assembly
Write multi-core parallel algorithms using the NoC
Profile your kernels and optimize for the architecture

The path from simple addition to complex AI workloads is paved with RISC-V instructions - hundreds of thousands of them, executing in parallel across the chip. This is RISC-V programming at a scale few other platforms can offer.

Welcome to the Tenstorrent RISC-V ecosystem. 880 cores are waiting for your code.

Appendix A: Quick Reference

Memory Map (Wormhole)

0x00000000 - 0x0016FFFF   L1 SRAM (1464 KB)
0xFFB00000 - 0xFFB00FFF   BRISC Local (4 KB)
0xFFB01000 - 0xFFB01FFF   NCRISC Local (4 KB)
0xFFB02000 - 0xFFB027FF   TRISC0 Local (2 KB)
0xFFB02800 - 0xFFB02FFF   TRISC1 Local (2 KB)
0xFFB03000 - 0xFFB037FF   TRISC2 Local (2 KB)
0xFFC00000 - 0xFFC03FFF   NCRISC IRAM (16 KB)

Common Device API Functions

// Runtime arguments
uint32_t get_arg_val<T>(uint32_t index);

// NoC operations
uint64_t get_noc_addr(uint32_t x, uint32_t y, uint32_t addr);
void noc_async_read(uint64_t src_addr, uint32_t dst_addr, uint32_t size);
void noc_async_write(uint32_t src_addr, uint64_t dst_addr, uint32_t size);
void noc_async_read_barrier();
void noc_async_write_barrier();

// Circular buffers
void cb_reserve_back(uint32_t cb_id, uint32_t num_tiles);
void cb_push_back(uint32_t cb_id, uint32_t num_tiles);
void cb_wait_front(uint32_t cb_id, uint32_t num_tiles);
void cb_pop_front(uint32_t cb_id, uint32_t num_tiles);
uint32_t get_write_ptr(uint32_t cb_id);
uint32_t get_read_ptr(uint32_t cb_id);

// Core info
extern uint8_t my_logical_x_;
extern uint8_t my_logical_y_;

// Debug
DPRINT << "message" << value << "\n";

Build Commands

# Build tt-metal with examples
./build_metal.sh --build-programming-examples

# Clean rebuild
./build_metal.sh --clean

# Enable ccache for faster rebuilds
./build_metal.sh --enable-ccache

Environment Variables

export TT_METAL_HOME=/path/to/tt-metal
export TT_METAL_DPRINT_CORES=0,0        # Enable debug prints
export TT_METAL_DEVICE_PROFILER=1       # Enable profiling
export MESH_DEVICE=N150                 # Hardware target

Appendix B: Resources

Official Documentation:

tt-metal GitHub: https://github.com/tenstorrent/tt-metal
Metalium Guide: tt-metal/METALIUM_GUIDE.md
Programming Examples: tt-metal/tt_metal/programming_examples/

RISC-V Resources:

RISC-V ISA Spec: https://riscv.org/technical/specifications/
RV32IM Reference: https://riscv.org/wp-content/uploads/2017/05/riscv-spec-v2.2.pdf

Community:

Tenstorrent Discord: https://discord.gg/tenstorrent
GitHub Issues: https://github.com/tenstorrent/tt-metal/issues

Document Version: 1.0 Last Updated: 2025-12-16 Target Hardware: Wormhole (N150/N300), Blackhole tt-metal Version: Latest main branch