TT-Lang Debug Print Spec

Overview

Python print() inside TT-Lang operations lowers to device debug prints via a ttl.dprint op and a dedicated lowering pass. The pass has full compiler context (DST assignments, CB indices, thread types) and emits the appropriate tt-metal DPRINT calls.

Enabled at runtime by TT_METAL_DPRINT_CORES=0,0. Zero overhead when not set (tt-metal compiles DPRINT to dead code).

Note: prints can be extremely large and slow! It is recommended to save to a file that you can grep, modify your program to only run on small targetted inputs, and only place prints in targetted temporary places.

Note: prints in compute will overlap and cause confusing output unless guarded with thread=X.

Python API

Scalars

print("hello world")
print("x =", x, "y =", y)
print(42)
print(3.14)

Supported argument types: string constants, integer constants, float constants, integer variables (index, i32). Error on unsupported types.

Circular buffer details

with inp_dfb.wait() as tile:
    print(inp_dfb)

Prints CB metadata: size, limit, page_size, num_pages, rd_ptr, wr_ptr.

Tile from CB (full tile)

with inp_dfb.wait() as tile:
    print(tile, thread="pack")

Prints the full 32x32 tile contents from the CB. The tile must be live (between wait/pop or reserve/push).

Note: will dump all registers in a block if using multi-tile block size (cb shape > 1x1) as the print will be inside the loop generated.

Note: unsupported on math thread.

Example output:

0:(x=1,y=1):TR2: ======
0:(x=1,y=1):TR2: 0 : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0:(x=1,y=1):TR2: 1 : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...
0:(x=1,y=1):TR2: 31 : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0:(x=1,y=1):TR2: ++++++
0:(x=1,y=1):TR2: ======

Tensor pages

print(inp, num_pages=2)

Prints raw memory pages from a tensor’s backing buffer. Data format (bf16, f32) and L1 address derived from the tensor type and layout during lowering. num_pages defaults to 1.

DST register dump

result = ttl.exp(i)
print(_dump_dst_registers=True, label="after exp")

Dumps all DST register slots that are live at this program point. The pass resolves assigned DST indices and includes the label and producing op name for each slot.

Note: will dump all registers in a block if using multi-tile block size (cb shape > 1x1) as the print will be inside the loop generated.

Note: only supports bf16.

Thread conditioning

Any print can be conditioned on a specific thread:

print(tile, thread="math")
print(tile, thread="pack")
print(inp_dfb, thread="unpack")

When thread is specified, the print is wrapped in the corresponding DPRINT_MATH(...), DPRINT_PACK(...), or DPRINT_UNPACK(...) macro. In compute kernels, the thread is automatically selected based on the print mode when no explicit thread is given: scalar and DST prints use math, CB and tile prints use pack. Tensor page prints (num_pages=) are only supported in datamovement kernels. In datamovement kernels, no wrapping is applied when thread is omitted.

In depth + code gen

Scalar

print("x =", x)
ttmlir::dprint("x = ", v0, "\n");

Scalars can stay on the existing ttkernel.dprint -> ttmlir::dprint path. The lowering pass does not need to touch these.

CB details

print(inp_dfb)
print_cb_details(get_compile_time_arg_val(0));

Pass resolves the CB index from the cb_index attribute on the defining bind_cb (or the lowered ttkernel compile-time arg).

Tile (full)

with inp_dfb.wait() as tile:
    print(tile)
print_full_tile(get_compile_time_arg_val(0), 0, true);

Pass traces the tile value back to its CB via cb_wait/attach_cb, resolves the CB index and tile index within the block.

Tensor pages

print(inp, num_pages=2)
// bf16 tensor example
print_bf16_pages(get_read_ptr(get_compile_time_arg_val(0)), 1024, 2);

Pass derives the data format from the tensor element type (bf16 -> print_bf16_pages, f32 -> print_f32_pages) and the L1 address from the tensor accessor. num_pages comes from the op attribute.

Should only be used in datamovement kernels.

DST dump

result = ttl.exp(i)
print(_dump_dst_registers=True, label="after exp")
{
  DPRINT << "=== after exp ===" << ENDL();
  DPRINT << "DST[0] (ttl.exp):" << ENDL();
  dprint_tensix_dest_reg(0);
}

Pass walks backward from the op to find all tile values with DST slot assignments at this program point. For each live slot, emits dprint_tensix_dest_reg(slot) with a label identifying the producing op.

Thread conditioning

print(tile, thread="math")
DPRINT_MATH(
  print_full_tile(get_compile_time_arg_val(0), 0, true);
);

Wraps the entire emitted block in the specified thread macro. In compute kernels, the thread is auto-selected per mode when not specified (scalar/DST -> math, CB/tile -> pack).

Example: instrumented compute kernel

@ttl.compute()
def compute():
    with inp_dfb.wait() as i, out_dfb.reserve() as o:
        print("compute start")
        print(inp_dfb)
        result = ttl.exp(i)
        print(_dump_dst_registers=True, label="after exp")
        o.store(result)

Generated C++ (compute kernel):

void kernel_main() {
  // ...
  cb_wait_front(get_compile_time_arg_val(0), 1);
  cb_reserve_back(get_compile_time_arg_val(1), 1);

  ttmlir::dprint("compute start\n");
  print_cb_details(get_compile_time_arg_val(0));

  tile_regs_acquire();
  copy_tile_init(get_compile_time_arg_val(0));
  copy_tile(get_compile_time_arg_val(0), 0, 0);
  exp_tile_init();
  exp_tile(0);

  {
    DPRINT << "=== after exp ===" << ENDL();
    DPRINT << "DST[0] (ttl.exp):" << ENDL();
    dprint_tensix_dest_reg(0);
  }

  tile_regs_commit();
  tile_regs_wait();
  pack_tile<true>(0, get_compile_time_arg_val(1), 0);
  tile_regs_release();

  cb_pop_front(get_compile_time_arg_val(0), 1);
  cb_push_back(get_compile_time_arg_val(1), 1);
}