Debug Checkpoints
Note
Tools are only fully supported on source builds.
Quick Start
# 1. Enable checkpoints + a print backend
export TT_METAL_CHECKPOINT=1
export TT_METAL_DPRINT_CORES=0,0
// 2. Add to EVERY kernel on the core (reader, writer, compute):
#include "api/debug/checkpoint.h"
void kernel_main() {
// ... work ...
DEBUG_CHECKPOINT("post_matmul"); // all RISCs sync → dump CB state → resume
// ... more work ...
}
# 3. Run your program — CB state is printed at each checkpoint
For standalone dumps (no barrier needed):
#include "api/debug/dump.h"
debug_dump_cb(0, 8); // CB0 metadata + 8 hex words
debug_dump_l1(0x100000, 16); // 16 words at L1 address
Overview
Debug checkpoints provide synchronized inspection points for fused kernels. When a checkpoint is hit, all active RISCs halt together, dump circular buffer state via DPRINT, then proceed in unison. Two levels are available:
Single-core (
DEBUG_CHECKPOINT): synchronizes all 5 RISCs on one core.Global (
DEBUG_CHECKPOINT_GLOBAL): synchronizes all RISCs on all tensix cores.
This generalizes dprint_tensix_dest_regs, which only synchronizes the three TRISC cores and
only dumps destination registers.
Enabling
Checkpoints (synchronized barriers) are enabled with:
export TT_METAL_CHECKPOINT=1
Without a print backend, checkpoints act as barriers only (no dump output). To get CB dump output, also enable DPRINT:
export TT_METAL_CHECKPOINT=1
export TT_METAL_DPRINT_CORES=0,0
Standalone dump utilities (debug_dump_cb, debug_dump_l1, etc.) require only DPRINT —
no TT_METAL_CHECKPOINT needed:
export TT_METAL_DPRINT_CORES=0,0
When neither is set, all dump functions and checkpoint macros are no-ops with zero overhead.
Note
TT_METAL_CHECKPOINT is read at JIT compile time. If you toggle it, clear the kernel
cache to force recompilation: rm -rf ~/.cache/tt-metal-cache
Single-Core Checkpoints
Usage
Include api/debug/checkpoint.h in every kernel (reader, writer, and compute) that
participates in the checkpoint. All active RISCs must call DEBUG_CHECKPOINT with the same name
at the corresponding point within the op.
Compute kernel:
#include "api/debug/checkpoint.h"
void kernel_main() {
// ... stage 1: unpack and compute ...
DEBUG_CHECKPOINT("post_unpack"); // all RISCs synchronize and dump CB state
// ... stage 2: pack output ...
}
Reader kernel (NCRISC):
#include "api/debug/checkpoint.h"
void kernel_main() {
// ... read tiles from DRAM into input CB ...
DEBUG_CHECKPOINT("post_unpack"); // must match the compute kernel's checkpoint name
}
Writer kernel (BRISC):
#include "api/debug/checkpoint.h"
void kernel_main() {
DEBUG_CHECKPOINT("post_unpack"); // synchronize before consuming output CB
// ... write tiles from output CB to DRAM ...
}
Every active RISC must call the checkpoint. If a RISC is active but does not call
DEBUG_CHECKPOINT, the barrier will hang.
Knobs
DEBUG_CHECKPOINT_EX provides compile-time knobs to control what gets dumped:
DEBUG_CHECKPOINT_EX(name, num_cbs, words_per_cb, dump_dest)
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
(required) |
Checkpoint name (string literal). All RISCs must use the same value. |
|
|
0 |
Number of CBs to dump. 0 means all configured CBs. |
|
|
0 |
Number of uint32 words of L1 data to hex-dump per CB. 0 means metadata only. |
|
|
false |
If true, TRISC1 (Math) dumps destination register contents instead of skipping. |
Examples:
// Dump metadata for all configured CBs
DEBUG_CHECKPOINT("post_matmul");
// Dump first 4 CBs, 8 words of L1 data each, plus dest registers
DEBUG_CHECKPOINT_EX("pre_pack", 4, 8, true);
How it works
L1 state. Before each kernel launch, BRISC writes a 20-byte checkpoint struct to
MEM_LLK_DEBUG_BASE in L1 (shared by all RISCs on the core):
participant_mask = 0x1F // bits 0-4: BRISC, NCRISC, TRISC0, TRISC1, TRISC2
proceed = 0 // monotonically increasing epoch counter
arrived[0..4] = 0 // per-RISC arrival flag (one byte each)
orchestrator_idx = 0 // lowest active RISC (BRISC)
Entry barrier. Each RISC hits debug_checkpoint_barrier() at its own pace:
Each RISC reads the current
proceedepoch (0) and computesnext_epoch = 1.Each RISC writes
arrived[my_idx] = 1— its own byte, no contention with other RISCs.The orchestrator (lowest active RISC, typically BRISC) polls all
arrived[]bytes, spinning withinvalidate_l1_cache()until all matchnext_epoch.All other RISCs spin on
proceed, waiting for it to reachnext_epoch(uses>=comparison to handle the case where the orchestrator has already advanced to the next barrier).Once the orchestrator sees all arrivals, it sets
proceed = 1, releasing everyone.
At this point all RISCs are synchronized — no RISC proceeds until every active RISC has arrived.
Dump. CB pointers (rd_ptr, wr_ptr, tiles_acked, tiles_received) are
RISC-specific, so each CB-capable RISC prints its own view:
BRISC, NCRISC, TRISC0 (Unpack), TRISC2 (Pack) each print CB metadata prefixed with their RISC index (e.g.,
RISC0,RISC2). Each sees different pointer values. Optionally hex-dumps L1 data at the read pointer.TRISC1 (Math) prints destination register contents if
dump_dest=true(only Math can access dest regs). Otherwise it prints nothing.
Exit barrier. Same mechanism with next_epoch = 2. This ensures no RISC moves past the
checkpoint until every RISC has finished printing — without it, a fast RISC could modify CBs
before a slow RISC finishes reading them.
Why per-byte flags (not a shared bitmask). The original design used
arrived_mask |= (1 << my_idx) — a read-modify-write on a shared uint32_t. If two RISCs
read the same stale value from L1 cache, they overwrite each other’s bit. Per-byte flags avoid
this: each RISC writes only arrived[my_idx], a distinct byte. The orchestrator reads all
bytes but never writes to another RISC’s byte.
Why an epoch counter (not a simple flag). If we used a 0/1 flag, the exit barrier would see
proceed already at 1 (from the entry barrier) and skip the wait. The monotonically increasing
epoch (0 → 1 → 2 → …) ensures each barrier waits for a unique value.
Global Checkpoints (Cross-Core)
DEBUG_CHECKPOINT_GLOBAL extends checkpoints to synchronize all RISCs on all tensix cores.
This is needed when a fused kernel spans multiple cores and you want a consistent snapshot of CB
state across the entire grid.
Usage
DEBUG_CHECKPOINT_GLOBAL(name, sem_id, barrier_coord_x, barrier_coord_y, num_cores)
Parameter |
Description |
|---|---|
|
Checkpoint name (string literal, for output labeling) |
|
Semaphore ID allocated by host via |
|
Physical NOC coordinates of the coordinator core for the cross-core barrier. These only affect synchronization, not what gets printed. |
|
Total number of cores participating |
Host setup:
// Allocate semaphore on all participating cores
CoreRange cores({0, 0}, {0, 1}); // 2 cores
uint32_t sem_id = CreateSemaphore(program, cores, 0);
// Get coordinator's physical NOC coordinates (for the barrier, not for printing)
CoreCoord barrier_coord = device->worker_core_from_logical_core({0, 0});
// Pass to all kernels as runtime args
SetRuntimeArgs(program, kernel, core, {
...,
sem_id, barrier_coord.x, barrier_coord.y, num_cores
});
Kernel usage (all RISCs on all cores must call with the same args):
#include "api/debug/checkpoint.h"
void kernel_main() {
uint32_t sem_id = get_arg_val<uint32_t>(3);
uint32_t barrier_coord_x = get_arg_val<uint32_t>(4);
uint32_t barrier_coord_y = get_arg_val<uint32_t>(5);
uint32_t num_cores = get_arg_val<uint32_t>(6);
// ... work ...
// All cores synchronize here, then each core prints its OWN local CB state
DEBUG_CHECKPOINT_GLOBAL("global_sync", sem_id, barrier_coord_x, barrier_coord_y, num_cores);
}
How it works
The global checkpoint layers a cross-core NOC semaphore barrier around the single-core intra-core barriers described above:
DEBUG_CHECKPOINT_GLOBAL:
┌─ intra-core barrier ──── all 5 RISCs on THIS core sync ─────┐
│ ┌─ cross-core barrier ── BRISC on ALL cores sync ────────┐ │
│ │ (NOC semaphore) │ │
│ └────────────────────────────────────────────────────────┘ │
├─ intra-core barrier ──── BRISC releases other RISCs ────────┤
│ │
│ DUMP: each CB-capable RISC prints its CB view │
│ │
├─ intra-core barrier ──── all 5 RISCs finish dumping ────────┤
│ ┌─ cross-core barrier ── BRISC on ALL cores sync ────────┐ │
│ └────────────────────────────────────────────────────────┘ │
└─ intra-core barrier ──── final release ─────────────────────┘
The single-core DEBUG_CHECKPOINT is the same structure but without the cross-core
barrier steps. Only BRISC participates in the NOC cross-core operations — TRISC and NCRISC
threads wait via the intra-core barriers that bracket the cross-core phase.
Step-by-step (example with 2 cores):
Intra-core barrier. Each core runs the single-core barrier independently (per-byte
arrived[]flags + epoch counter atMEM_LLK_DEBUG_BASE). After this, all 5 RISCs on each core are halted together — but the two cores are NOT yet synchronized with each other.-
Cross-core barrier (BRISC only). Only BRISC on each core participates. The other 4 RISCs are blocked at the next intra-core barrier (step 3), waiting for BRISC.
Both BRISCs atomically increment the coordinator’s semaphore via NOC:
noc_semaphore_inc(coordinator_sem_addr, 1). Both target the same physical L1 address on the coordinator core.noc_semaphore_incis a hardware atomic.The semaphore is monotonic across global checkpoints — it is never reset to 0 between barriers. Each barrier instance waits for its own expected count (previous barrier’s target +
num_cores), which avoids the race where a reset could lose increments from a subsequent barrier.Coordinator BRISC: Its local semaphore IS the one being incremented. Calls
noc_semaphore_wait_min(local_sem, expected_count)— a local spin, no NOC reads.Non-coordinator BRISC: Polls the coordinator’s semaphore via
noc_async_readinto its own local semaphore copy, checking until the value reachesexpected_count.
The semaphore accumulates monotonically — no reset between barriers. Each barrier advances
expected_countbynum_cores, so barrier N waits forN * num_cores. This avoids the race where a reset could lose increments from a subsequent barrier.barrier_coordmust be the same for all global checkpoints in a program. Intra-core barrier. BRISC has returned from the cross-core barrier. The other 4 RISCs were spinning here. BRISC’s arrival advances the epoch, releasing them. All 10 RISCs (5 per core × 2 cores) are now synchronized.
Dump. Each CB-capable RISC prints its own view of CB metadata (pointers are RISC-specific). If
dump_dest=true, TRISC1 also prints dest registers.Intra-core barrier. Ensures all RISCs finish printing before any proceeds.
Cross-core barrier. Same mechanism as step 2, with
expected_countadvanced to2 * num_cores. Ensures all cores finish before any proceeds.Final intra-core barrier. Releases all RISCs after the cross-core exit barrier.
Output Format
Each CB-capable RISC prints a header with the checkpoint name and its RISC index:
=== CKPT post_matmul RISC0 CBs ===
CB0 sz=128 rd=1024 wr=1152 ack=0 rcv=1
CB16 sz=128 rd=2048 wr=2048 ack=0 rcv=0
=== CKPT post_matmul RISC2 CBs ===
CB0 sz=128 rd=1024 wr=1024 ack=0 rcv=1
When dump_dest=true, TRISC1 (Math) also prints destination register contents:
=== CKPT pre_pack dest regs ===
...
When words_per_cb > 0, L1 data at the read pointer is printed in hex:
CB0 sz=128 rd=1024 wr=1152 ack=0 rcv=1
[0] 3f800000 40000000 40400000 40800000
[4] 40a00000 40c00000 40e00000 41000000
The CB metadata fields are:
sz: FIFO size (in address-shifted units)
rd: read pointer
wr: write pointer
ack: tiles acked (consumed)
rcv: tiles received (produced)
Standalone Dump Utilities
In addition to the full checkpoint barrier, standalone dump functions are available for quick
inspection of individual CBs or arbitrary L1 memory. These can be called from any kernel at any
point — no barrier or synchronization required. Include api/debug/dump.h.
debug_dump_cb(cb_id, num_words)
Prints CB metadata and optionally raw hex data starting at the read pointer. Available on BRISC, NCRISC, TRISC0, and TRISC2 (not TRISC1/Math, which cannot access CB interfaces).
#include "api/debug/dump.h"
debug_dump_cb(0); // CB0 metadata only
debug_dump_cb(0, 8); // CB0 metadata + 8 hex words from read pointer
debug_dump_cb(16, 4); // CB16 metadata + 4 hex words
Output:
CB0 sz=128 rd=1024 wr=1152 ack=0 rcv=1
[0] 3f800000 40000000 40400000 40800000
[4] 40a00000 40c00000 40e00000 41000000
debug_dump_cb_typed(cb_id, tile_idx)
Prints tile data interpreted according to the CB’s data format, showing actual float/int values. Uses TileSlice internally.
Available on TRISC0 (Unpack), TRISC2 (Pack), BRISC, and NCRISC. On BRISC/NCRISC, an additional
cb_type parameter specifies whether the CB is an input or output.
// On TRISC0 (Unpack) or TRISC2 (Pack):
debug_dump_cb_typed(0, 0); // CB0, tile 0, untilized
// On BRISC or NCRISC (need to specify input vs output):
debug_dump_cb_typed(0, 0, TSLICE_INPUT_CB); // CB0 as input CB
debug_dump_cb_typed(16, 0, TSLICE_OUTPUT_CB); // CB16 as output CB
debug_dump_l1(addr, num_words)
Hex-dumps arbitrary L1 memory. Available from all RISCs. Useful for inspecting semaphores, scratch space, or any L1 region.
debug_dump_l1(0x100000, 16); // 16 words starting at L1 address 0x100000
Output:
L1[0x100000] 16 words:
[0x100000] 3f800000 40000000 40400000 40800000
[0x100010] 40a00000 40c00000 40e00000 41000000
[0x100020] 41100000 41200000 41300000 41400000
[0x100030] 41500000 41600000 41700000 41800000
All three functions are no-ops when DPRINT is not enabled.
Comparison with dprint_tensix_dest_regs
Feature |
|
|
|---|---|---|
RISCs synchronized |
TRISC0, TRISC1, TRISC2 only |
All active RISCs (BRISC, NCRISC, TRISC0/1/2) |
What is dumped |
Destination register contents |
CB metadata per RISC (+ optional dest regs via Math, + optional L1 data) |
Callable from |
Compute kernels only |
Any kernel, but all active RISCs must participate |
BRISC/NCRISC involvement |
None |
Full participation; each CB-capable RISC prints its own CB view |
Use dprint_tensix_dest_regs when you only need to inspect compute output in dest registers.
Use DEBUG_CHECKPOINT when you need a consistent snapshot of the entire dataflow + compute
pipeline at a specific point within a large op.
Files
File |
Purpose |
|---|---|
|
Checkpoint API: single-core barrier, cross-core barrier, CB dump, dest reg dump |
|
Standalone dump utilities: |
|
|
|
Calls |