n150 n300 T3000 p100 p150 p300c Galaxy Sim 60 min Draft

Twenty-and-Ten-Plus-One Things You Can Do with ttsim

ttsim is a hardware-accurate functional simulator for Tenstorrent Wormhole and Blackhole chips. It ships as a single .so file that plugs into TT-Metalium via an environment variable. Every kernel that compiles for silicon compiles for the simulator. Results are bit-exact. It runs on any Linux/x86_64 machine, including WSL2 on Windows.

This lesson is self-contained. Setup is below. No Tenstorrent hardware required.

Have hardware? The simulator is still useful for debugging, architecture exploration, and running experiments without tying up a device.

ttsim highlight reel — 6 of 31 entries running against ttsim v1.8.0
ttsim demo — click to open full size


Setup

⚙ Set Up ttsim
mkdir -p ~/sim
wget -q https://github.com/tenstorrent/ttsim/releases/download/v1.8.0/libttsim_wh.so -O ~/sim/libttsim_wh.so || { echo "ERROR: failed to download libttsim_wh.so"; exit 1; }
wget -q https://github.com/tenstorrent/ttsim/releases/download/v1.8.0/libttsim_bh.so -O ~/sim/libttsim_bh.so || { echo "ERROR: failed to download libttsim_bh.so"; exit 1; }
wget -q https://github.com/tenstorrent/ttsim/releases/download/v1.8.0/libttsim_wh_x2.so -O ~/sim/libttsim_wh_x2.so || { echo "ERROR: failed to download libttsim_wh_x2.so"; exit 1; }
if [ -n "$TT_METAL_HOME" ]; then
  cp $TT_METAL_HOME/tt_metal/soc_descriptors/wormhole_b0_80_arch.yaml ~/sim/soc_descriptor.yaml || { echo "ERROR: failed to copy SOC descriptor"; exit 1; }
  cp $TT_METAL_HOME/tests/tt_metal/tt_fabric/custom_mock_cluster_descriptors/n300_cluster_desc.yaml ~/sim/n300_cluster_desc.yaml || { echo "WARNING: n300 cluster desc copy skipped (optional for N300 sim)"; }
else
  echo "TT_METAL_HOME not set — SOC descriptor copy skipped"
fi
echo "ttsim v1.8.0 ready (wh + bh + wh_x2 for N300 multichip)"

Or manually:

mkdir -p ~/sim
TTSIM_VERSION=v1.8.0

# Download Wormhole, Blackhole, and N300 (2-chip Wormhole mesh) simulators
wget https://github.com/tenstorrent/ttsim/releases/download/${TTSIM_VERSION}/libttsim_wh.so \
     -O ~/sim/libttsim_wh.so
wget https://github.com/tenstorrent/ttsim/releases/download/${TTSIM_VERSION}/libttsim_bh.so \
     -O ~/sim/libttsim_bh.so
wget https://github.com/tenstorrent/ttsim/releases/download/${TTSIM_VERSION}/libttsim_wh_x2.so \
     -O ~/sim/libttsim_wh_x2.so

# Copy the SOC descriptor for Wormhole (switch for Blackhole in entries 3 and 27)
cp $TT_METAL_HOME/tt_metal/soc_descriptors/wormhole_b0_80_arch.yaml ~/sim/soc_descriptor.yaml

# Copy the N300 cluster descriptor (used for multichip simulation — entry 31)
cp $TT_METAL_HOME/tests/tt_metal/tt_fabric/custom_mock_cluster_descriptors/n300_cluster_desc.yaml \
   ~/sim/n300_cluster_desc.yaml

# Required env vars — set these before running any entry below
export TT_METAL_SIMULATOR=~/sim/libttsim_wh.so
export TT_METAL_SLOW_DISPATCH_MODE=1
export TT_METAL_DISABLE_SFPLOADMACRO=1

Prerequisite: tt-metal must be installed and built. If you haven't done that yet, start with the build tt-metal lesson first.

All examples below run from $TT_METAL_HOME unless noted.


The Twenty

1. Run Tenstorrent on Windows

WSL2 + libttsim_wh.so. Set the three env vars above inside a WSL2 session and every entry in this lesson works. No hardware. No special drivers. No silicon anywhere in the chain.

# In a WSL2 terminal on Windows:
export TT_METAL_SIMULATOR=~/sim/libttsim_wh.so
export TT_METAL_SLOW_DISPATCH_MODE=1
export TT_METAL_DISABLE_SFPLOADMACRO=1
# Then run any entry in this lesson

2. Hello, RISC-V

add_2_integers_in_riscv dispatches a kernel onto the BRISC (data-movement RISC-V core) of a virtual Tensix. Two integers added together. Real RISC-V ISA. Real dispatch path.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
cd $TT_METAL_HOME
./build/programming_examples/metal_example_add_2_integers_in_riscv
Success: Result is 21

3. Own both chips for free

Download both .so files (the setup above does this). Switch architectures by changing one environment variable and replacing the SOC descriptor.

# Switch to Blackhole (140-core SOC)
cp $TT_METAL_HOME/tt_metal/soc_descriptors/blackhole_140_arch.yaml ~/sim/soc_descriptor.yaml
export TT_METAL_SIMULATOR=~/sim/libttsim_bh.so

./build/programming_examples/metal_example_add_2_integers_in_riscv

# Switch back to Wormhole
cp $TT_METAL_HOME/tt_metal/soc_descriptors/wormhole_b0_80_arch.yaml ~/sim/soc_descriptor.yaml
export TT_METAL_SIMULATOR=~/sim/libttsim_wh.so
⬡ Tensix Grid Visualizer Blackhole (P100/P150/P300c)

4. Talk to the compute engine

The compute RISC-V (TRISC) is a separate processor from the data-movement RISC-V. hello_world_compute_kernel dispatches a kernel specifically to the TRISC.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_hello_world_compute_kernel
Hello, Core (0, 0) on Device 0, I am sending you a compute kernel. Standby awaiting communication.
Thank you, Core {0, 0} on Device 0, for the completed task.

5. Elementary school math on an AI accelerator

2 + 3 = 5, dispatched through a chip designed to run large language models. The full dispatch path — host program, command queue, kernel compilation, BRISC/TRISC execution — for a trivial operation.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_add_2_integers_in_compute
Success: Result matches expected value!

6. Invoke the Special Function Processing Unit

The SFPU is a vector unit inside each Tensix core that performs transcendental functions as native hardware operations — exp, log, sqrt, gelu. These are silicon opcodes, not library calls.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_eltwise_sfpu
Test Passed

7. Chain SFPU ops into a pipeline

sfpu_eltwise_chain runs a sequence of SFPU operations on a tile without intermediate results touching DRAM. The values stay in the register file between steps. This is how softmax is computed on Tensix hardware.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_sfpu_eltwise_chain
Metalium vs Golden -- PCC = 0.99986374

8. The kernel that runs when you're watching is not the kernel that runs when you're not

TT_METAL_DPRINT_CORES is checked at kernel compilation time — not at runtime. Setting it changes what code gets compiled into the kernel binary. The observation changes the experiment.

# Without DPRINT: standard kernel binary, no instrumentation
./build/programming_examples/metal_example_hello_world_datamovement_kernel

# With DPRINT: a different kernel binary is compiled and dispatched
export TT_METAL_DPRINT_CORES=0,0
export TT_METAL_DPRINT_RISCVS=BR
./build/programming_examples/metal_example_hello_world_datamovement_kernel
unset TT_METAL_DPRINT_CORES TT_METAL_DPRINT_RISCVS

The second invocation prints from inside the running kernel. The first does not — the instrumentation was never compiled in.


9. Operate on 1,024 values simultaneously

A tile is a 32×32 array of bfloat16 values. eltwise_binary adds, subtracts, or multiplies every element in a single dispatched operation.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_eltwise_binary
Test Passed

10. Run the matmul that powers everything

Matrix multiplication is the fundamental operation of transformer inference. matmul_single_core runs it on one core, start to finish, in tile layout.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_matmul_single_core
Output vector of size 409600
Metalium vs Golden -- PCC = 0.982093
Test Passed

11. Light up the grid

matmul_multi_core distributes the same matrix multiplication across multiple cores.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_matmul_multi_core
Output vector of size 409600
Metalium vs Golden -- PCC = 0.9999391
Test Passed

12. Why SRAM reuse is the whole secret

matmul_multicore_reuse keeps weight tiles in L1 SRAM across multiple output tiles instead of re-fetching from DRAM. This is the optimization that closes the gap between raw FLOP capacity and memory bandwidth on Tensix hardware.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_matmul_multicore_reuse
Metalium vs Golden -- PCC = 0.99930096
Test Passed

13. Spread a vector add across every core

vecadd_multi_core gives every core a slice of the input. All cores compute simultaneously.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_vecadd_multi_core
Kernel execution finished
Partial results: (note we are running under BFP16. It's going to be less accurate)
All results match expected values within tolerance.

14. Stripe data across DRAM banks

vecadd_sharding distributes tensor data across multiple DRAM channels on the same chip. A single Tensix chip has multiple DRAM banks and benefits from using all of them.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_vecadd_sharding
Sharding 4x4 tiles to 4x1 cores in TensorMemoryLayout::HEIGHT_SHARDED mode
Each core will handle 1x4 tiles

Kernel execution finished. Reading results...
Partial results: (note we are running under BFP16. It's going to be less accurate)
All results match expected values within tolerance.

15. Send a tile across the mesh interconnect

noc_tile_transfer moves a tile from core (0,0) to core (0,1) via the on-chip network. No CPU involvement after dispatch. The tile travels the NoC and arrives.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_noc_tile_transfer
Result = 14 : Expected = 14

16. Write a custom SFPU instruction

custom_sfpi_add is hand-authored SFPI assembly — the instruction set of the SFPU functional unit. This is ISA-level code for a production AI accelerator.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_custom_sfpi_add
Test Passed

17. Implement smoothstep in SFPU assembly

custom_sfpi_smoothstep implements the smoothstep interpolation function — a standard graphics shader primitive — as SFPU opcodes. The function has no relationship to AI inference. Running it on a Tenstorrent chip is completely unnecessary and entirely possible.

./build/programming_examples/metal_example_custom_smoothstep
Test Passed

18. Dispatch a program to a mesh

1_distributed_program_dispatch uses the mesh device API. The code is structurally identical to single-device dispatch — the API scales, and so does the program.

./build/programming_examples/distributed/distributed_program_dispatch

ttsim note: Requires an 8-device MeshShape [2, 4] — ttsim simulates one chip only. This example requires a multi-device system (T3000, Galaxy, TT-QuietBox 2, or equivalent).


19. Read and write distributed buffers

2_distributed_buffer_rw manages memory across a virtual mesh. Every tensor-parallel model does this operation millions of times per inference.

./build/programming_examples/distributed/distributed_buffer_rw

ttsim note: Requires an 8-device mesh — ttsim simulates one chip only. Run on a multi-device system (T3000, Galaxy, TT-QuietBox 2) to execute this example.


20. The primitive of tensor parallelism

3_distributed_eltwise_add performs an element-wise addition across a virtual mesh. Splitting a tensor across devices, computing in parallel, gathering results — this is the building block that lets a model span multiple chips.

./build/programming_examples/distributed/distributed_eltwise_add
Total values: 1024
Distributed elementwise add verification: 1024 / 1024

ttsim note: Requires an 8-device mesh — ttsim simulates one chip only. Run on a multi-device system (T3000, Galaxy, TT-QuietBox 2) to execute this example.


The Ten

21. Trace async execution without a profiler

4_distributed_trace_and_events instruments async barriers and event timelines across a virtual mesh. The shape of the execution trace matches hardware. The timings do not.

./build/programming_examples/distributed/distributed_trace_and_events
Running EltwiseBinary MeshTraces on 2 MeshCQs Passed!

ttsim note: Requires an 8-device mesh — ttsim simulates one chip only. Run on a multi-device system (T3000, Galaxy, TT-QuietBox 2) to execute this example.


22. Trigger intentional UndefinedBehavior and read the named error

Write a kernel that violates an ISA contract. The simulator halts with a named, categorized error. On silicon, the same code would likely produce silently incorrect output.

The simulator is more strict than the hardware on purpose. Error categories from the documentation:

To trigger one: set TT_METAL_DISABLE_SFPLOADMACRO=0 (re-enable the unsupported macro) and run any SFPU example. The simulator will report UnimplementedFunctionality for the SFPLOADMACRO instruction. Silicon would execute it silently.

unset TT_METAL_DISABLE_SFPLOADMACRO
./build/programming_examples/metal_example_eltwise_sfpu 2>&1 | grep -i "unimplemented\|undefined\|error" | head -5
export TT_METAL_DISABLE_SFPLOADMACRO=1

23. Multicast to a core rectangle in one shot

The multicast example sends a value to every core in a rectangular range simultaneously. This is the mechanism behind weight broadcasting in large matrix multiplications — one sender, all receivers, a single NoC transaction.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
export TT_METAL_DPRINT_CORES='(0,0)-(3,0)'
export TT_METAL_DPRINT_PREPEND_DEVICE_CORE_RISC=0
./build/programming_examples/contributed/multicast
Hello, Core (0, 0) on Device 0, please multicast the tile to your neighbors.
CORE (0,0): Tile ready for multicast. I am starting all inbound kernels in cores in given range.
CORE (1,0): Inbound kernel has received and acknowledged its tile.
CORE (2,0): Inbound kernel has received and acknowledged its tile.
CORE (3,0): Inbound kernel has received and acknowledged its tile.
Thank you, Core (0, 0) on Device 0, for the multicast.

=========== MULTICASTED TILE VERIFICATION ===========
[✅ PASS] Receiver tile 1 matches the golden tile.
[✅ PASS] Receiver tile 2 matches the golden tile.
[✅ PASS] Receiver tile 3 matches the golden tile.
[✅ PASS] All 3 receiver tiles match the golden tile.
=====================================================

24. Run the transformer attention kernel

matmul_multicore_reuse keeps weight tiles in L1 SRAM across multiple output tiles. This is the core optimization that drives transformer attention layers — weights loaded once, used many times across a grid of output cores.

⬡ Tensix Grid Visualizer Wormhole (N150/N300/T3K)
./build/programming_examples/metal_example_matmul_multicore_reuse
Metalium vs Golden -- PCC = 0.99930096
Test Passed

25. Produce a bit-exact NaN and verify the bit pattern

ttsim guarantees bit-exact results for all operations, including the precise bit representation of NaN values. Divide bfloat16 zero by zero. Check the bit pattern against the ISA specification. If you have hardware available, compare the two — they match.

import struct
import ttnn
import torch

device = ttnn.open_device(device_id=0)
zero = ttnn.from_torch(torch.zeros(32, 32, dtype=torch.bfloat16),
                       layout=ttnn.TILE_LAYOUT, device=device)
result = ttnn.div(zero, zero)
result_cpu = ttnn.to_torch(ttnn.from_device(result)).float()
actual = result_cpu[0, 0].item()
print(f"Result: {actual}")
print(f"Is NaN: {actual != actual}")
ttnn.close_device(device)

26. Measure kernel dispatch cost vs. kernel run cost

The profiler examples include test_custom_cycle_count_slow_dispatch, which uses software cycle instrumentation inside a kernel to measure how much time is spent dispatching versus executing.

Note: hardware performance counter values (cycle timers, performance monitors) are intentionally divergent on the simulator — the README states this explicitly. Software cycle counting inside kernels still works.

./build/programming_examples/profiler/test_custom_cycle_count_slow_dispatch
Test Passed

The ratio of dispatch overhead to execution time at this workload size tells you when a kernel is too small to schedule efficiently.


27. Simulate Blackhole on a machine that has never seen Blackhole

Switch to libttsim_bh.so. Your machine is now running kernels under a 140-core Blackhole SOC model. Use add_2_integers_in_riscv — compiled for WH, but the data-movement dispatch path works on both architectures.

cp $TT_METAL_HOME/tt_metal/soc_descriptors/blackhole_140_arch.yaml ~/sim/soc_descriptor.yaml
export TT_METAL_SIMULATOR=~/sim/libttsim_bh.so

./build/programming_examples/metal_example_add_2_integers_in_riscv

# Switch back
cp $TT_METAL_HOME/tt_metal/soc_descriptors/wormhole_b0_80_arch.yaml ~/sim/soc_descriptor.yaml
export TT_METAL_SIMULATOR=~/sim/libttsim_wh.so

To run compute-heavy examples on the BH simulator, build tt-metal targeting Blackhole (TT_METAL_ARCH_NAME=blackhole) and the compiled binaries will run under libttsim_bh.so. WH-compiled compute kernels fail on the BH simulator with named divergence errors — you can read exactly which ISA feature mismatch caused the fault, without touching a P-series card.


28. Find the race condition the simulator catches but silicon hides

Write a two-kernel program where the second kernel reads a buffer the first kernel writes, with no synchronization barrier between them. On silicon this probably passes. The hardware evaluates operations in a consistent order that happens to be correct for this workload, nearly every time. On the simulator, the README states: "ttsim may evaluate operations in any order permitted by software synchronization. This may include operation orders that are extremely unlikely on silicon."

# Deploy the demo script
mkdir -p ~/tt-scratchpad/ttsim
# Copy from the tt-vscode-toolkit checkout (adjust TOOLKIT_DIR to match yours):
TOOLKIT_DIR="${TOOLKIT_DIR:-~/code/tt-vscode-toolkit}"
cp $TOOLKIT_DIR/content/templates/ttsim/ttsim_race_demo.py ~/tt-scratchpad/ttsim/

export TT_METAL_SIMULATOR=~/sim/libttsim_wh.so
python3 ~/tt-scratchpad/ttsim/ttsim_race_demo.py
With barrier:    CORRECT
Without barrier: CORRECT

NOTE: both paths ship with synchronization in place.
Exercise: in run_without_barrier(), remove the ttnn.from_device() call
on the line marked '# remove this line to race', then run again.
The simulator may produce 'WRONG (race detected)' where silicon would pass.

Follow the exercise in the script. The barrier was always necessary. The simulator shows you why.


29. Use the SFPU as a DSP core

The SFPU — 16-lane SIMD, bfloat16 and fp32, native transcendental functions — has the same computational structure as a DSP block in a custom silicon design. Implement a second-order IIR (biquad) filter in Python/TTNN, run it on a test signal, and characterize bfloat16 numerical error against a float64 reference.

mkdir -p ~/tt-scratchpad/ttsim
# Copy from the tt-vscode-toolkit checkout (adjust TOOLKIT_DIR to match yours):
TOOLKIT_DIR="${TOOLKIT_DIR:-~/code/tt-vscode-toolkit}"
cp $TOOLKIT_DIR/content/templates/ttsim/ttsim_biquad_kernel.py ~/tt-scratchpad/ttsim/
python3 ~/tt-scratchpad/ttsim/ttsim_biquad_kernel.py
Biquad filter: 1024 samples
bfloat16 max error vs float64 reference: 0.0089
PASSED

The ISA documentation (tt-isa-documentation on GitHub) describes the full SFPU instruction encoding, register file, and opcode table. If you are designing a DSP chip or custom accelerator and want a verified functional model of a pipelined transcendental SIMD unit to drive your RTL requirements, this is one. You would not be taping out a Tensix core. You would be using a working functional model to characterize an algorithm before your RTL team writes a line of Verilog.


30. Run a transformer layer through the simulator

A transformer attention layer requires Q/K/V projections (linear), scaled dot-product attention (batched matmul), softmax (SFPU chain), and output projection (linear). Every one of these is a confirmed working TTNN operation in slow dispatch mode. The following script implements one attention head — no model download, no HuggingFace token, no weight file.

▶ Run Transformer Attention on ttsim
export TT_METAL_SIMULATOR=~/sim/libttsim_wh.so
export TT_METAL_SLOW_DISPATCH_MODE=1
export TT_METAL_DISABLE_SFPLOADMACRO=1
python3 ~/tt-scratchpad/ttsim/ttsim_attention.py

Or manually:

mkdir -p ~/tt-scratchpad/ttsim
TOOLKIT_DIR="${TOOLKIT_DIR:-~/code/tt-vscode-toolkit}"
cp $TOOLKIT_DIR/content/templates/ttsim/ttsim_attention.py ~/tt-scratchpad/ttsim/
python3 ~/tt-scratchpad/ttsim/ttsim_attention.py
Attention output shape: torch.Size([1, 32, 64])
PCC vs PyTorch reference: 0.999847
PASSED

The output is correct. Verified against the PyTorch reference. Running on a chip that does not exist in this machine.


31. One more thing

v1.8.0 ships libttsim_wh_x2.so: a virtual N300 that gives you two Wormhole chips connected by simulated Ethernet. Open a MeshDevice(1, 2), shard a tensor across both chips with ShardTensorToMesh, run an op — it dispatches to both chips simultaneously. Same TTNN API you'd use on a real N300.

export TT_METAL_SIMULATOR=~/sim/libttsim_wh_x2.so
export TT_METAL_MOCK_CLUSTER_DESC_PATH=~/sim/n300_cluster_desc.yaml
export TT_METAL_SLOW_DISPATCH_MODE=1
export TT_METAL_DISABLE_SFPLOADMACRO=1
import torch, ttnn

# Open a 1×2 mesh — two virtual Wormhole chips (N300 topology)
mesh = ttnn.open_mesh_device(ttnn.MeshShape(1, 2))
print(mesh)  # MeshDevice(1x2 grid, 2 devices)

a = torch.randn(64, 64, dtype=torch.bfloat16)
b = torch.randn(64, 64, dtype=torch.bfloat16)

# Shard: top 32 rows → chip 0, bottom 32 rows → chip 1
a_mesh = ttnn.from_torch(a, layout=ttnn.TILE_LAYOUT, device=mesh,
                          mesh_mapper=ttnn.ShardTensorToMesh(mesh, dim=0))
b_mesh = ttnn.from_torch(b, layout=ttnn.TILE_LAYOUT, device=mesh,
                          mesh_mapper=ttnn.ShardTensorToMesh(mesh, dim=0))

# Dispatches to both chips in parallel
c_mesh = ttnn.add(a_mesh, b_mesh)

# Reconstruct full result
c = ttnn.to_torch(c_mesh, mesh_composer=ttnn.ConcatMeshToTensor(mesh, dim=0))
ttnn.close_mesh_device(mesh)

Or run the complete script:

mkdir -p ~/tt-scratchpad/ttsim
TOOLKIT_DIR="${TOOLKIT_DIR:-~/code/tt-vscode-toolkit}"
cp $TOOLKIT_DIR/content/templates/ttsim/ttsim_n300_mesh.py ~/tt-scratchpad/ttsim/
python3 ~/tt-scratchpad/ttsim/ttsim_n300_mesh.py
Opened mesh: MeshDevice(1x2 grid, 2 devices)
Max error vs reference: 0.031250
✅ PASS — N300 mesh add (64x64, sharded across 2 chips)

Two devices. One simulation. The exact same API call you'd use on real N300 hardware — ShardTensorToMesh, ttnn.add, ConcatMeshToTensor. The multi-chip path works before you have the hardware.

matmul_multicore_reuse on the simulator takes about five seconds. On a P300c it takes milliseconds. On a QuietBox with four P300cs, less than that.

Two things the simulator still cannot give you.

First: the performance counter values. Reads from hardware cycle counters and performance monitors return values the README explicitly marks as divergent. The simulator does not model real-time execution.

Second: fast dispatch. TT_METAL_SLOW_DISPATCH_MODE=1 is required. The fast dispatch path is not yet implemented. On hardware, turning off slow dispatch mode is the moment the architecture behaves differently. The dispatch overhead collapses. The ratio you measured in entry 26 changes by an order of magnitude.

There is a third thing, harder to describe. The biquad filter in entry 29 runs in the simulator. On silicon, with fast dispatch enabled, 1,024 samples of biquad filtering at bfloat16 precision completes in a time that has no analogue in software. The same arithmetic. The same bit patterns. A different physical reality.

The simulator gave you the model. Hardware gives you the thing.

To return to single-chip mode: export TT_METAL_SIMULATOR=~/sim/libttsim_wh.so and unset TT_METAL_MOCK_CLUSTER_DESC_PATH.


What You Learned

Ready for hardware? Start with verifying your installation to confirm your device is operational, then return here and run entries 30–31 again.