# Programming Guide This page covers compiler options, print debugging, performance tools, the simulator, and examples for TT-Lang operation development. ## Compiler Options Operations accept compiler options that control code generation (e.g., `--no-ttl-maximize-dst`, `--no-ttl-fpu-binary-ops`). These can be passed as command-line arguments, via the `@ttl.operation` decorator's `options=` parameter, or the `TTLANG_COMPILER_OPTIONS` environment variable. Command-line arguments take highest priority. ```bash # List available options python examples/elementwise-tutorial/step_4_multinode_grid_auto.py --ttl-help # Run an operation with options python examples/elementwise-tutorial/step_4_multinode_grid_auto.py --no-ttl-maximize-dst ``` See the [full compiler options reference](reference/compiler-options.md) for all decorator parameters, `CompilerOptions` flags with their MLIR pass mappings, environment variables, and `ttlang-opt` pass options. ## Print Debugging Use `print()` inside kernel code to emit device debug prints. Enable at runtime with `TT_METAL_DPRINT_CORES`: ```bash export TT_METAL_DPRINT_CORES=0,0 # core to capture python my_kernel.py 2>&1 > output.txt ``` ```python @ttl.compute() def compute(): with inp_dfb.wait() as tile, out_dfb.reserve() as o: print("hello") # auto: math thread print(tile) # auto: pack thread result = ttl.exp(tile) print(_dump_dst_registers=True, label="after exp") # auto: math thread o.store(result) @ttl.datamovement() def dm_write(): print(out_dfb) # CB metadata with out_dfb.wait() as blk: print(blk, num_pages=1) # raw tensor page tx = ttl.copy(blk, out[0, 0]) tx.wait() ``` - Prints can be extremely large and slow; redirect output to a file and use grep. - In compute kernels, guard prints with `thread="math"`, `thread="pack"`, or `thread="unpack"` to avoid overlapping output from the three TRISC threads. - When using multi-tile block sizes (CB shape > 1x1), prints inside the generated loop will dump all tiles in the block. See the [full print debugging reference](reference/print-debugging.md) for all supported modes (scalars, tiles, tensor pages, CB details, DST registers, thread conditioning). ## Performance Tools TT-Lang includes built-in performance analysis tools for profiling operations on hardware: - Perf Summary (`TTLANG_PERF_DUMP=1`) — NOC traffic and per-kernel wall time breakdown - Auto-Profiling (`TTLANG_AUTO_PROFILE=1`) — automatic per-line cycle count instrumentation - User-Defined Signposts (`TTLANG_SIGNPOST_PROFILE=1`) — targeted cycle counts for `ttl.signpost()` regions - Perfetto Trace Server (`TTLANG_PERF_SERV=1`) — visualize profiler data in the Perfetto UI Performance tracing (Tracy) is enabled by default at build time. To disable it, configure with `-DTTLANG_ENABLE_PERF_TRACE=OFF`. See the [full performance tools reference](reference/performance-tools.md) for environment variable details, valid combinations, and sample output. ## Simulator See the [Functional Simulator](simulator.md) page for running kernels without hardware, debugging setup, and test commands. ## Examples See the `examples/` and `test/` directories for complete working examples, including: - `test/python/simple_add.py` - `test/python/simple_fused.py` The [tour](tour/index.md) provides an introduction to TT-Lang features.