# TT-Lang Debug Print Spec ## Overview Python `print()` inside TT-Lang operations lowers to device debug prints via a `ttl.dprint` op and a dedicated lowering pass. The pass has full compiler context (DST assignments, CB indices, thread types) and emits the appropriate tt-metal DPRINT calls. Enabled at runtime by `TT_METAL_DPRINT_CORES=0,0`. Zero overhead when not set (tt-metal compiles DPRINT to dead code). Note: prints can be extremely large and slow! It is recommended to save to a file that you can grep, modify your program to only run on small targetted inputs, and only place prints in targetted temporary places. Note: prints in compute will overlap and cause confusing output unless guarded with thread=X. ## Python API ### Scalars ```python print("hello world") print("x =", x, "y =", y) print(42) print(3.14) ``` Supported argument types: string constants, integer constants, float constants, integer variables (index, i32). Error on unsupported types. ### Circular buffer details ```python with inp_dfb.wait() as tile: print(inp_dfb) ``` Prints CB metadata: size, limit, page_size, num_pages, rd_ptr, wr_ptr. ### Tile from CB (full tile) ```python with inp_dfb.wait() as tile: print(tile, thread="pack") ``` Prints the full 32x32 tile contents from the CB. The tile must be live (between wait/pop or reserve/push). Note: will dump all registers in a block if using multi-tile block size (cb shape > 1x1) as the print will be inside the loop generated. Note: unsupported on math thread. Example output: ``` 0:(x=1,y=1):TR2: ====== 0:(x=1,y=1):TR2: 0 : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0:(x=1,y=1):TR2: 1 : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 0:(x=1,y=1):TR2: 31 : 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0:(x=1,y=1):TR2: ++++++ 0:(x=1,y=1):TR2: ====== ``` ### Tensor pages ```python print(inp, num_pages=2) ``` Prints raw memory pages from a tensor's backing buffer. Data format (bf16, f32) and L1 address derived from the tensor type and layout during lowering. `num_pages` defaults to 1. ### DST register dump ```python result = ttl.exp(i) print(_dump_dst_registers=True, label="after exp") ``` Dumps all DST register slots that are live at this program point. The pass resolves assigned DST indices and includes the label and producing op name for each slot. Note: will dump all registers in a block if using multi-tile block size (cb shape > 1x1) as the print will be inside the loop generated. Note: only supports bf16. ### Thread conditioning Any print can be conditioned on a specific thread: ```python print(tile, thread="math") print(tile, thread="pack") print(inp_dfb, thread="unpack") ``` When `thread` is specified, the print is wrapped in the corresponding `DPRINT_MATH(...)`, `DPRINT_PACK(...)`, or `DPRINT_UNPACK(...)` macro. In compute kernels, the thread is automatically selected based on the print mode when no explicit `thread` is given: scalar and DST prints use `math`, CB and tile prints use `pack`. Tensor page prints (`num_pages=`) are only supported in datamovement kernels. In datamovement kernels, no wrapping is applied when `thread` is omitted. ## In depth + code gen ### Scalar ```python print("x =", x) ``` ```cpp ttmlir::dprint("x = ", v0, "\n"); ``` Scalars can stay on the existing `ttkernel.dprint` -> `ttmlir::dprint` path. The lowering pass does not need to touch these. ### CB details ```python print(inp_dfb) ``` ```cpp print_cb_details(get_compile_time_arg_val(0)); ``` Pass resolves the CB index from the `cb_index` attribute on the defining `bind_cb` (or the lowered ttkernel compile-time arg). ### Tile (full) ```python with inp_dfb.wait() as tile: print(tile) ``` ```cpp print_full_tile(get_compile_time_arg_val(0), 0, true); ``` Pass traces the tile value back to its CB via `cb_wait`/`attach_cb`, resolves the CB index and tile index within the block. ### Tensor pages ```python print(inp, num_pages=2) ``` ```cpp // bf16 tensor example print_bf16_pages(get_read_ptr(get_compile_time_arg_val(0)), 1024, 2); ``` Pass derives the data format from the tensor element type (`bf16` -> `print_bf16_pages`, `f32` -> `print_f32_pages`) and the L1 address from the tensor accessor. `num_pages` comes from the op attribute. Should only be used in datamovement kernels. ### DST dump ```python result = ttl.exp(i) print(_dump_dst_registers=True, label="after exp") ``` ```cpp { DPRINT << "=== after exp ===" << ENDL(); DPRINT << "DST[0] (ttl.exp):" << ENDL(); dprint_tensix_dest_reg(0); } ``` Pass walks backward from the op to find all tile values with DST slot assignments at this program point. For each live slot, emits `dprint_tensix_dest_reg(slot)` with a label identifying the producing op. ### Thread conditioning ```python print(tile, thread="math") ``` ```cpp DPRINT_MATH( print_full_tile(get_compile_time_arg_val(0), 0, true); ); ``` Wraps the entire emitted block in the specified thread macro. In compute kernels, the thread is auto-selected per mode when not specified (scalar/DST -> math, CB/tile -> pack). ## Example: instrumented compute kernel ```python @ttl.compute() def compute(): with inp_dfb.wait() as i, out_dfb.reserve() as o: print("compute start") print(inp_dfb) result = ttl.exp(i) print(_dump_dst_registers=True, label="after exp") o.store(result) ``` Generated C++ (compute kernel): ```cpp void kernel_main() { // ... cb_wait_front(get_compile_time_arg_val(0), 1); cb_reserve_back(get_compile_time_arg_val(1), 1); ttmlir::dprint("compute start\n"); print_cb_details(get_compile_time_arg_val(0)); tile_regs_acquire(); copy_tile_init(get_compile_time_arg_val(0)); copy_tile(get_compile_time_arg_val(0), 0, 0); exp_tile_init(); exp_tile(0); { DPRINT << "=== after exp ===" << ENDL(); DPRINT << "DST[0] (ttl.exp):" << ENDL(); dprint_tensix_dest_reg(0); } tile_regs_commit(); tile_regs_wait(); pack_tile(0, get_compile_time_arg_val(1), 0); tile_regs_release(); cb_pop_front(get_compile_time_arg_val(0), 1); cb_push_back(get_compile_time_arg_val(1), 1); } ```