TT-Lang Debug Print Spec

Overview

Python print() inside TT-Lang operations lowers to device debug prints via a ttl.dprint op and a dedicated lowering pass. The pass has full compiler context (DST assignments, DFB indices, thread types) and emits the appropriate tt-metal DPRINT calls.

Enabled at runtime by TT_METAL_DPRINT_CORES=0,0. Zero overhead when not set (tt-metal compiles DPRINT to dead code).

Note: prints can be extremely large and slow! It is recommended to save to a file that you can grep, modify your program to only run on small targetted inputs, and only place prints in targetted temporary places.

Note: prints in compute will overlap and cause confusing output unless guarded with thread=X.

Python API

Scalars

print("hello world")
print("x =", x, "y =", y)
print(42)
print(3.14)

Supported argument types: string constants, integer constants, float constants, integer variables (index, i32). Error on unsupported types.

Dataflow buffer details

with inp_dfb.wait() as tile:
    print(inp_dfb)

Prints DFB metadata: size, limit, page_size, num_pages, rd_ptr, wr_ptr.

Tile from DFB (full tile)

with inp_dfb.wait() as tile:
    print(tile, thread="pack")

Prints the full 32x32 tile contents from the DFB. The tile must be live (between wait/pop or reserve/push).

Note: will dump all registers in a block if using multi-tile block size (DFB shape > 1x1) as the print will be inside the loop generated.

Note: unsupported on math thread.

Example output:

(x=1,y=1):TR2: ======
(x=1,y=1):TR2: 0 : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(x=1,y=1):TR2: 1 : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...
(x=1,y=1):TR2: 31 : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(x=1,y=1):TR2: ++++++
(x=1,y=1):TR2: ======

Tensor pages

print(inp, num_pages=2)

Prints raw memory pages from a tensor’s backing buffer. Data format (bf16, f32) and L1 address derived from the tensor type and layout during lowering. num_pages defaults to 1.

DST register dump

result = ttl.exp(i)
print(_dump_dst_registers=True, label="after exp")

Dumps all DST register slots that are live at this program point. The pass resolves assigned DST indices and includes the label and producing op name for each slot.

Note: will dump all registers in a block if using multi-tile block size (DFB shape > 1x1) as the print will be inside the loop generated.

Note: only supports bf16.

Thread conditioning

Any print can be conditioned on a specific thread:

print(tile, thread="math")
print(tile, thread="pack")
print(inp_dfb, thread="unpack")

When thread is specified, the print is wrapped in the corresponding DPRINT_MATH(...), DPRINT_PACK(...), or DPRINT_UNPACK(...) macro. In compute kernels, the thread is automatically selected based on the print mode when no explicit thread is given: scalar and DST prints use math, DFB and tile prints use pack. Tensor page prints (num_pages=) are only supported in datamovement kernels. In datamovement kernels, no wrapping is applied when thread is omitted.

In depth + code gen

Scalar

print("x =", x)

ttmlir::dprint("x = ", v0, "\n");

Scalars can stay on the existing ttkernel.dprint -> ttmlir::dprint path. The lowering pass does not need to touch these.

DFB details

print(inp_dfb)

print_cb_details(get_compile_time_arg_val(0));

Pass resolves the DFB index from the cb_index attribute on the defining bind_cb (or the lowered ttkernel compile-time arg).

Tile (full)

with inp_dfb.wait() as tile:
    print(tile)

print_full_tile(get_compile_time_arg_val(0), 0, true);

Pass traces the tile value back to its DFB via cb_wait/attach_cb, resolves the DFB index and tile index within the block.

Tensor pages

print(inp, num_pages=2)

// bf16 tensor example
print_bf16_pages(get_read_ptr(get_compile_time_arg_val(0)), 1024, 2);

Pass derives the data format from the tensor element type (bf16 -> print_bf16_pages, f32 -> print_f32_pages) and the L1 address from the tensor accessor. num_pages comes from the op attribute.

Should only be used in datamovement kernels.

DST dump

result = ttl.exp(i)
print(_dump_dst_registers=True, label="after exp")

{
  DPRINT << "=== after exp ===" << ENDL();
  DPRINT << "DST[0] (ttl.exp):" << ENDL();
  dprint_tensix_dest_reg(0);
}

Pass walks backward from the op to find all tile values with DST slot assignments at this program point. For each live slot, emits dprint_tensix_dest_reg(slot) with a label identifying the producing op.

Thread conditioning

print(tile, thread="math")

DPRINT_MATH(
  print_full_tile(get_compile_time_arg_val(0), 0, true);
);

Wraps the entire emitted block in the specified thread macro. In compute kernels, the thread is auto-selected per mode when not specified (scalar/DST -> math, DFB/tile -> pack).

Example: instrumented compute kernel

@ttl.compute()
def compute():
    with inp_dfb.wait() as i, out_dfb.reserve() as o:
        print("compute start")
        print(inp_dfb)
        result = ttl.exp(i)
        print(_dump_dst_registers=True, label="after exp")
        o.store(result)

Generated C++ (compute kernel):

void kernel_main() {
  // ...
  cb_wait_front(get_compile_time_arg_val(0), 1);
  cb_reserve_back(get_compile_time_arg_val(1), 1);

  ttmlir::dprint("compute start\n");
  print_cb_details(get_compile_time_arg_val(0));

  tile_regs_acquire();
  copy_tile_init(get_compile_time_arg_val(0));
  copy_tile(get_compile_time_arg_val(0), 0, 0);
  exp_tile_init();
  exp_tile(0);

  {
    DPRINT << "=== after exp ===" << ENDL();
    DPRINT << "DST[0] (ttl.exp):" << ENDL();
    dprint_tensix_dest_reg(0);
  }

  tile_regs_commit();
  tile_regs_wait();
  pack_tile<true>(0, get_compile_time_arg_val(1), 0);
  tile_regs_release();

  cb_pop_front(get_compile_time_arg_val(0), 1);
  cb_push_back(get_compile_time_arg_val(1), 1);
}