N150 N300 T3K P100 P150 P300C Galaxy 30 min Draft

Module 4: Networks and Communication

Introduction: How 880 Cores Talk

You have 880 cores. They need to share data. How do they communicate?

This module answers that question.

What You'll Learn

Key Insight: The network is the bottleneck. Optimize communication, not computation.


Part 1: CS Theory - Network Fundamentals

⬡ Tensix Grid Visualizer Blackhole (P100/P150/P300c)

The Communication Problem

Scenario: Core 0 needs data from Core 100's memory.

Options:

Option 1: Shared Memory (Like Your Laptop's CPU)

Core 0 → [L1 Cache] → [L2 Cache] → [L3 Cache] → Memory
                                                    ↓
Core 100 → [L1 Cache] → [L2 Cache] → [L3 Cache] → Memory

Pros: Transparent (looks like local memory) Cons: Cache coherence overhead, doesn't scale beyond ~64 cores

Option 2: Message Passing (Like Tenstorrent)

Core 0 → [Send message via NoC] → Core 100

Pros: Scales to 1000+ cores, explicit control Cons: Programmer must manage communication

Tenstorrent uses Option 2.

Network Topologies

How do you connect N nodes? Many options:

1. Bus (Shared Medium)

Pros: Simple, cheap Cons: Only one message at a time (serialization), doesn't scale Example: Old computer systems, PCIe bus

2. Crossbar (Full Connectivity)

Pros: Any-to-any communication, no contention Cons: O(N²) wires, doesn't scale beyond ~100 nodes Example: Small routers, interconnects

3. Mesh (Tenstorrent's Choice)

Pros: Scales well (O(N) wires), regular structure, good for 2D chips Cons: Multi-hop latency (corner to corner takes many hops) Example: Intel Xeon Phi, Google TPU, Tenstorrent

graph LR
    subgraph "Bus (Poor Scaling)"
    B0["Core 0"] --- BUS["Shared Bus"]
    B1["Core 1"] --- BUS
    B2["Core 2"] --- BUS
    B3["Core 3"] --- BUS
    end

    subgraph "Crossbar (Doesn't Scale)"
    X0["Core 0"] -.->|Direct| X1["Core 1"]
    X0 -.->|Direct| X2["Core 2"]
    X0 -.->|Direct| X3["Core 3"]
    X1 -.->|Direct| X2
    X1 -.->|Direct| X3
    X2 -.->|Direct| X3
    end

    subgraph "Mesh (Scalable)"
    M0["Core 0"] --- M1["Core 1"]
    M1 --- M2["Core 2"]
    M2 --- M3["Core 3"]
    M0 --- M4["Core 4"]
    M1 --- M5["Core 5"]
    M2 --- M6["Core 6"]
    M3 --- M7["Core 7"]
    M4 --- M5
    M5 --- M6
    M6 --- M7
    end

    style BUS fill:#ff6b6b,stroke:#fff,color:#fff
    style M0 fill:#3293b2,stroke:#fff,color:#fff
    style M7 fill:#3293b2,stroke:#fff,color:#fff

Why mesh wins: Balance of scalability, latency, and wire count.

Routing: Finding the Path

Problem: Core (0, 0) wants to send to Core (3, 2). What path?

Routing Algorithm: XY Routing (Tenstorrent's approach)

Rule: Go X direction first, then Y direction

From (0,0) to (3,2):
1. Move right (0,0) → (1,0) → (2,0) → (3,0)  (X direction)
2. Move up   (3,0) → (3,1) → (3,2)           (Y direction)

Total hops: 3 (X) + 2 (Y) = 5 hops

Advantages:

Disadvantage:

Latency vs Bandwidth (Again, But For Networks)

Latency: Time for first bit to arrive

Latency = (# hops) × (per-hop latency) + (wire delay)

Example: 5 hops × 1 cycle + 0 cycles = 5 cycles

Bandwidth: Rate of data transfer

Bandwidth = (link width) × (frequency) / (# hops)

Example: 32 bytes/cycle per link
         If message is pipelined across hops, bandwidth is sustained

Small messages: Latency-bound (overhead dominates) Large messages: Bandwidth-bound (transfer time dominates)


Part 2: Industry Context - Networks Everywhere

Data Center Networks

Google's data center:

10,000+ servers connected via network
Problem: Any server needs to talk to any other server
Solution: Clos network (multi-level mesh)

Same principles as NoC:

Tenstorrent's NoC is a data center network on a chip.

GPU Interconnects

NVIDIA NVLink (GPU-to-GPU):

GPU 0 <--[900 GB/s NVLink]--> GPU 1
         vs
GPU 0 <--[25 GB/s PCIe]--> CPU <--[25 GB/s]--> GPU 1

36x faster with direct links!

Principle: Direct point-to-point is faster than going through a hub.

Tenstorrent's NoC: Direct core-to-core communication (no CPU hub)

Ethernet / Internet

Internet routing:

Your laptop → Router → ISP → Backbone → Destination

Routing: BGP (Border Gateway Protocol)
Topology: Hierarchical (star + mesh)
Latency: ~50-100ms (speed of light + routing delays)

NoC routing:

Core (0,0) → 5 hops → Core (3,2)

Routing: XY (deterministic)
Topology: 2D mesh
Latency: ~5 cycles (~5 nanoseconds!)

10 million times faster because:


Part 3: On Tenstorrent Hardware - The NoC

Wormhole NoC Architecture

graph TD
    subgraph "Wormhole Chip (12×12 Grid)"
    T00["Tensix(0,0)"] --- T01["Tensix(1,0)"]
    T01 --- T02["Tensix(2,0)"]
    T00 --- T10["Tensix(0,1)"]
    T01 --- T11["Tensix(1,1)"]
    T02 --- T12["Tensix(2,1)"]
    T10 --- T20["Tensix(0,2)"]
    T11 --- T21["Tensix(1,2)"]
    T12 --- T22["Tensix(2,2)"]

    T00 --- DRAM0["DRAMBank 0"]
    T02 --- PCIE["PCIe"]
    T22 --- DRAM1["DRAMBank 1"]
    end

    style T00 fill:#3293b2,stroke:#fff,color:#fff
    style T11 fill:#3293b2,stroke:#fff,color:#fff
    style T22 fill:#3293b2,stroke:#fff,color:#fff
    style DRAM0 fill:#5347a4,stroke:#fff,color:#fff
    style DRAM1 fill:#5347a4,stroke:#fff,color:#fff
    style PCIE fill:#499c8d,stroke:#fff,color:#fff

Key components:

Each Tensix has a router:

        North ↑
             │
West ← [Router] → East
             │
        South ↓
       +
     Local (to Tensix core)

5-port router: North, South, East, West, Local

NoC Address Format

64-bit NoC address:

┌───────────┬──────────┬──────────────────────┐
│ NoC Y (16)│ NoC X (16)│ Local Address (32)   │
└───────────┴──────────┴──────────────────────┘
  Bits 48-63  Bits 32-47  Bits 0-31

Example: Access L1 SRAM at offset 0x1000 on core (3, 2)
NoC Address = (2 << 48) | (3 << 32) | 0x1000
            = 0x0002_0003_0000_1000

Building a NoC address in code:

uint64_t get_noc_addr(uint32_t x, uint32_t y, uint32_t local_addr) {
    return ((uint64_t)y << 48) | ((uint64_t)x << 32) | local_addr;
}

// Example: Core (5, 3), L1 offset 0x2000
uint64_t addr = get_noc_addr(5, 3, 0x2000);

NoC DMA Operations

Three types of transfers:

1. Point-to-Point (Read)

// Read 1 KB from core (3,2) to my L1
uint64_t remote_addr = get_noc_addr(3, 2, 0x1000);
noc_async_read(remote_addr, my_l1_addr, 1024);
noc_async_read_barrier();

2. Point-to-Point (Write)

// Write 1 KB from my L1 to core (5,7)
uint64_t remote_addr = get_noc_addr(5, 7, 0x2000);
noc_async_write(my_l1_addr, remote_addr, 1024);
noc_async_write_barrier();

3. Multicast (One-to-Many)

// Send same data to multiple cores
CoreRange dest = CoreRange{{0,0}, {3,3}};  // 4×4 = 16 cores
noc_async_write_multicast(my_l1_addr, dest, remote_addr, 1024);
noc_async_write_barrier();

Multicast optimization:


Part 4: Hands-On - Measuring NoC Performance

Experiment 1: Latency vs Distance

Question: Does distance matter?

Test: Send 4 bytes to cores at different distances

// Kernel: measure_noc_latency.cpp
void kernel_main() {
    uint32_t my_x = get_core_x();
    uint32_t my_y = get_core_y();

    // Test different destinations
    uint32_t test_cases[5][2] = {
        {my_x+1, my_y},      // 1 hop (adjacent)
        {my_x+2, my_y},      // 2 hops
        {my_x+5, my_y},      // 5 hops
        {my_x+10, my_y},     // 10 hops
        {my_x+10, my_y+10}   // 20 hops (Manhattan distance)
    };

    for (int i = 0; i < 5; i++) {
        uint64_t remote_addr = get_noc_addr(test_cases[i][0], test_cases[i][1], 0x1000);

        uint64_t start = get_cycle_count();
        noc_async_read(remote_addr, my_l1_addr, 4);  // Read 4 bytes
        noc_async_read_barrier();
        uint64_t cycles = get_cycle_count() - start;

        DPRINT << "Distance " << test_cases[i][0] - my_x + test_cases[i][1] - my_y
               << " hops: " << cycles << " cycles\n";
    }
}

Expected results:

Distance 1 hop:  ~5 cycles
Distance 2 hops: ~6 cycles
Distance 5 hops: ~9 cycles
Distance 10 hops: ~14 cycles
Distance 20 hops: ~24 cycles

Observation: ~1 cycle per hop + ~4 cycle base latency

Takeaway: Distance matters! Keep communicating cores close together.

Experiment 2: Bandwidth vs Message Size

Question: How does transfer size affect bandwidth?

Test: Transfer different message sizes

// Kernel: measure_noc_bandwidth.cpp
void kernel_main() {
    uint32_t sizes[] = {4, 64, 256, 1024, 4096, 16384};  // Bytes

    for (int i = 0; i < 6; i++) {
        uint64_t remote_addr = get_noc_addr(5, 5, 0x1000);  // Fixed distance

        uint64_t start = get_cycle_count();
        noc_async_read(remote_addr, my_l1_addr, sizes[i]);
        noc_async_read_barrier();
        uint64_t cycles = get_cycle_count() - start;

        float bandwidth = sizes[i] / (float)cycles;  // Bytes per cycle
        DPRINT << "Size " << sizes[i] << " bytes: "
               << cycles << " cycles, "
               << bandwidth << " GB/s\n";
    }
}

Expected results:

Size 4 bytes:     ~10 cycles  → 0.4 bytes/cycle  (latency-bound)
Size 64 bytes:    ~12 cycles  → 5.3 bytes/cycle
Size 256 bytes:   ~20 cycles  → 12.8 bytes/cycle
Size 1024 bytes:  ~48 cycles  → 21.3 bytes/cycle
Size 4096 bytes:  ~160 cycles → 25.6 bytes/cycle (bandwidth-saturated)
Size 16384 bytes: ~600 cycles → 27.3 bytes/cycle

Graph: Bandwidth vs Message Size

xychart-beta
    title "NoC Bandwidth vs Message Size"
    x-axis ["4 B", "64 B", "256 B", "1 KB", "4 KB", "16 KB"]
    y-axis "Bandwidth (bytes/cycle)" 0 --> 32
    line [1.0, 6.4, 12.8, 21.3, 25.6, 27.3]

Takeaway: Small messages waste bandwidth (latency overhead). Large messages amortize latency.

Optimization: Batch small transfers into one large transfer.

Experiment 3: Multicast vs Unicast

Question: Is multicast really faster?

Test: Send to 16 cores via multicast vs 16 unicast

// Kernel: compare_multicast.cpp
void kernel_main() {
    CoreRange dest = CoreRange{{0,0}, {3,3}};  // 4×4 grid = 16 cores

    // Method 1: Unicast (sequential sends)
    uint64_t start = get_cycle_count();
    for (uint32_t y = 0; y < 4; y++) {
        for (uint32_t x = 0; x < 4; x++) {
            uint64_t remote_addr = get_noc_addr(x, y, 0x1000);
            noc_async_write(my_l1_addr, remote_addr, 1024);
        }
    }
    noc_async_write_barrier();
    uint64_t unicast_cycles = get_cycle_count() - start;

    // Method 2: Multicast (single send, hardware replicates)
    start = get_cycle_count();
    noc_async_write_multicast(my_l1_addr, dest, 0x1000, 1024);
    noc_async_write_barrier();
    uint64_t multicast_cycles = get_cycle_count() - start;

    DPRINT << "Unicast to 16 cores: " << unicast_cycles << " cycles\n";
    DPRINT << "Multicast to 16 cores: " << multicast_cycles << " cycles\n";
    DPRINT << "Speedup: " << (float)unicast_cycles / multicast_cycles << "x\n";
}

Expected results:

Unicast to 16 cores: ~800 cycles  (16 × 50 cycles per send)
Multicast to 16 cores: ~60 cycles (one send, hardware replicates)
Speedup: 13.3x

Takeaway: Multicast is critical for broadcast patterns (all-reduce, synchronization).


Part 5: Communication Patterns and Optimization

Pattern 1: All-to-All (Worst Case)

Problem: Every core sends to every other core

// Each core sends its data to ALL other cores
for (uint32_t dest_y = 0; dest_y < 12; dest_y++) {
    for (uint32_t dest_x = 0; dest_x < 12; dest_x++) {
        if (dest_x != my_x || dest_y != my_y) {
            uint64_t remote_addr = get_noc_addr(dest_x, dest_y, 0x1000);
            noc_async_write(my_l1_addr, remote_addr, 1024);
        }
    }
}

Cost: 176 cores × 175 destinations × 50 cycles = 1.5 million cycles

Problem: Massive congestion (many cores sending simultaneously)

Solution: Phase the communication (different cores send at different times)

Pattern 2: Reduction (Sum Across All Cores)

Problem: Compute sum of values across all cores

Naive approach: Send everything to Core 0 (sequential bottleneck)

// All cores send to core (0,0)
if (my_x != 0 || my_y != 0) {
    noc_async_write(my_value, get_noc_addr(0, 0, offset), 4);
}
// Core (0,0) sums 176 values serially

Cost: 176 × 50 cycles = 8,800 cycles

Optimized approach: Tree-based reduction (parallel)

// Phase 1: Pairs of cores reduce (88 parallel operations)
if (my_id % 2 == 0) {
    receive_from(my_id + 1);
    my_value += received_value;
}

// Phase 2: Reduce again (44 parallel operations)
if (my_id % 4 == 0) {
    receive_from(my_id + 2);
    my_value += received_value;
}

// ... Continue log2(176) = 8 phases ...

// Final result in core 0

Cost: 8 phases × 50 cycles = 400 cycles (22x faster!)

This is what MPI_Reduce does internally.

Pattern 3: Nearest-Neighbor (Best Case)

Problem: Each core needs data from adjacent cores only

// Read from 4 neighbors (N, S, E, W)
uint64_t north = get_noc_addr(my_x, my_y+1, 0x1000);
uint64_t south = get_noc_addr(my_x, my_y-1, 0x1000);
uint64_t east  = get_noc_addr(my_x+1, my_y, 0x1000);
uint64_t west  = get_noc_addr(my_x-1, my_y, 0x1000);

// Read 1 KB from each neighbor (4 × 1-hop transfers)
noc_async_read(north, my_l1+0, 1024);
noc_async_read(south, my_l1+1024, 1024);
noc_async_read(east,  my_l1+2048, 1024);
noc_async_read(west,  my_l1+3072, 1024);
noc_async_read_barrier();

Cost: 4 × 50 cycles = 200 cycles (all cores in parallel, no congestion)

This is the best communication pattern (local, parallel, low latency).

Applications: Stencil computations, cellular automata, diffusion equations


Part 6: Discussion Questions

Question 1: Why Not a Crossbar?

Q: Crossbar has lower latency (direct connections). Why use a mesh?

A: Cost and scalability.

For 176 cores, crossbar is infeasible. For 1000+ cores (future), mesh still scales.

Question 2: What About Congestion?

Q: What if 10 cores all send to the same destination?

A: The network serializes them (one at a time).

Cores 0,1,2,3,4 all send to Core 50 simultaneously:
- Core 0's message arrives first (5 hops × 1 cycle = 5 cycles)
- Cores 1-4 must wait (queued at routers)
- Total time: 5 + 4×50 = 205 cycles (vs 50 cycles with no congestion)

Optimization: Distribute destinations (avoid hotspots)

In practice: This is a real problem (e.g., all-reduce has hotspot at root). Tree-based algorithms mitigate this.

Question 3: How Does This Compare to Ethernet?

Ethernet (1 Gbps):

Bandwidth: 125 MB/s = 0.125 bytes per nanosecond
Latency: ~0.1 ms = 100,000 nanoseconds

Tenstorrent NoC (1 GHz clock):

Bandwidth: 32 bytes/cycle = 32 bytes per nanosecond (256x faster!)
Latency: 5-20 cycles = 5-20 nanoseconds (5000x faster!)

Why the massive difference?

NoC is to Ethernet what L1 is to DRAM (different scales, same principles).


Part 7: Real-World Example - Flash Attention

Flash Attention is a breakthrough algorithm that reduces attention complexity from O(N²) to O(N) by careful communication management.

Standard Attention (Communication-Heavy)

# Standard attention (naive)
Q = query_matrix   # (seq_len, d)
K = key_matrix     # (seq_len, d)
V = value_matrix   # (seq_len, d)

# Compute attention scores (all-to-all communication!)
scores = Q @ K.T   # (seq_len, seq_len) - HUGE MATRIX
attn = softmax(scores)
output = attn @ V

Problem: scores matrix is O(N²) in memory and communication

For seq_len = 16K:

Flash Attention (Communication-Optimized)

# Flash Attention (optimized)
# Break Q, K, V into blocks that fit in L1
for q_block in Q_blocks:
    for k_block in K_blocks:
        # Load blocks into L1 (small transfer)
        scores_block = q_block @ k_block.T  # Compute on L1
        attn_block = softmax(scores_block)  # Stays in L1
        output_block += attn_block @ v_block
        # No need to store full scores matrix!

Optimization:

This is why Flash Attention is so important - it's a network optimization, not an algorithm optimization.


Part 8: Connections to Other Systems

CPUs: Implicit Communication via Cache

x86 Core-to-Core Communication:

Core 0 writes X
Core 1 reads X

Hardware automatically:
1. Invalidates Core 1's cache line
2. Transfers data via cache coherence protocol (MESI)
3. Updates Core 1's cache

Tenstorrent Core-to-Core:

Core (0,0) writes X
Core (1,0) reads X

Programmer explicitly:
1. Core (0,0): noc_async_write(x, get_noc_addr(1,0,addr), 4)
2. Core (1,0): barrier, then read from L1

Tradeoff: Explicit is more code, but scales to 880 cores.

GPUs: Similar NoC Architecture

NVIDIA A100:

Communication patterns:

// GPU code (similar to Tenstorrent!)
__shared__ float shared_data[256];  // L1-like shared memory

// Each thread loads its data
shared_data[threadIdx.x] = global_data[globalIdx];
__syncthreads();  // Barrier (wait for all threads)

// Now all threads can access shared_data

Same principles: explicit communication, barriers, local memory.

Data Centers: Network Topology

Google data center:

Clos Network (multi-level mesh)
         [Spine switches]
               ↓
         [Leaf switches]
               ↓
          [Servers]

Tenstorrent chip:

2D Mesh Network
    [DRAM] ← [Tensix cores] → [DRAM]
                  ↓
              [PCIe]

Same topology principles at different scales.


Part 9: Key Takeaways

After this module, you should understand:

The Core Insight

The network determines scalability.

Tenstorrent's mesh NoC:

Communication patterns matter as much as algorithms.


Part 10: Preview of Module 5 - Synchronization

We've seen how cores communicate. But what if two cores access the same memory simultaneously?

Teaser questions:

  1. Race condition: Core 0 writes X=5, Core 1 writes X=10. What's the final value?
  2. Deadlock: Core 0 waits for Core 1, Core 1 waits for Core 0. What happens?
  3. Barriers: How do you ensure all cores finish Phase 1 before starting Phase 2?

Module 5 teaches synchronization primitives and how to avoid concurrency bugs.


Additional Resources

Network Theory

NoC Design

Tenstorrent Resources


Summary

We explored:

Key lesson: Optimize communication patterns, not just algorithms.

Next: We explore synchronization and coordination across 880 cores.

→ Continue to Module 5: Synchronization