# Performance Tools TT-Lang provides built-in performance analysis tools for profiling operations on Tenstorrent hardware. All tools are activated via environment variables and print results after operation execution. ## Using Claude for Performance Analysis The `/ttl-profile` and `/ttl-optimize` [Claude Code](https://claude.com/claude-code) skills are designed to work with these tools. Claude knows about all of the performance tools documented here and can walk you through adding signposts, running the profiler, interpreting perf dumps, and optimizing your kernels. See the [Claude Skills](../claude-skills.md) page for setup instructions. ## Environment Variables | Variable | Description | |---|---| | `TT_METAL_HOME` | Path to tt-metal (required for all profiling) | | `TT_METAL_DEVICE_PROFILER=1` | Enable device profiler (required for all profiling) | | `TT_METAL_PROFILER_MID_RUN_DUMP=1` | Enable mid-run profiler dumps (required for all profiling) | | `TT_METAL_DEVICE_PROFILER_NOC_EVENTS=1` | Enable NOC event tracing (required for perf summary) | | `TTLANG_PERF_DUMP=1` | Print NOC traffic and per-kernel wall time summary | | `TTLANG_AUTO_PROFILE=1` | Instrument every line with signposts and print per-line cycle counts | | `TTLANG_SIGNPOST_PROFILE=1` | Print cycle counts for user-defined `ttl.signpost()` regions | | `TTLANG_PERF_SERV=1` | Serve profiler data as a Perfetto trace after execution | **Valid combinations:** - `TTLANG_PERF_DUMP` can be used standalone or alongside either profiler. - `TTLANG_AUTO_PROFILE` and `TTLANG_SIGNPOST_PROFILE` must be used independently (not together). Auto-profiling instruments every line automatically; signpost profiling only measures user-annotated regions. - `TTLANG_PERF_SERV` can be combined with either `TTLANG_AUTO_PROFILE` or `TTLANG_SIGNPOST_PROFILE` to visualize results in Perfetto. ## Perf Summary Set `TTLANG_PERF_DUMP=1` to print a NOC traffic and per-kernel wall time summary after operation execution. **Required environment variables** (must be exported before running): ```bash export TT_METAL_HOME=/path/to/tt-metal export TT_METAL_DEVICE_PROFILER=1 export TT_METAL_DEVICE_PROFILER_NOC_EVENTS=1 export TT_METAL_PROFILER_MID_RUN_DUMP=1 export TTLANG_PERF_DUMP=1 python path/to/program.py # just run with python ``` **Sample output:** ``` --- Program 1024 (__demo_kernel) --- grid: 1x1 (1 nodes) duration: 2,225,436 cycles (1.65 ms) DRAM read: 5.4 MB (2790 transfers) DRAM write: 5.0 MB (2582 transfers) effective BW: 6.7 GB/s (total payload / duration) transfer size: 2.0 KB (uniform) barriers: 57 read (1 per 49 reads), 161 write (1 per 16 writes) noc reads: NOC_0=2790 noc writes: NOC_1=2582 DRAM channels: 16 kernel time: BRISC 2,225,356 cycles (1.65 ms) NCRISC 2,211,871 cycles (1.64 ms) TRISC_0 2,222,025 cycles (1.65 ms) TRISC_1 2,222,876 cycles (1.65 ms) TRISC_2 2,222,358 cycles (1.65 ms) ``` ### Standalone usage The perf summary tool can also be run standalone against previously collected profiler logs. This works with any tt-metal program, not just TT-Lang operations -- it parses the same NOC trace JSON and device profiler CSV that tt-metal's profiling infrastructure produces. ```bash # Default: reads from $TT_METAL_HOME/generated/profiler/.logs/ python -m ttl._src.perf_summary # Custom path python -m ttl._src.perf_summary --path /path/to/profiler/.logs/ # Machine-readable JSON output python -m ttl._src.perf_summary --path /path/to/profiler/.logs/ --json # Filter to specific operation names python -m ttl._src.perf_summary --names "my_kernel,ttnn.multiply" ``` ## Auto-Profiling TT-Lang includes built-in auto-profiling that instruments operations with signposts and generates per-line cycle count reports. **Required environment variables** (must be exported before running): ```bash export TT_METAL_HOME=/path/to/tt-metal export TT_METAL_DEVICE_PROFILER=1 export TT_METAL_PROFILER_MID_RUN_DUMP=1 export TTLANG_AUTO_PROFILE=1 ``` **Example:** ```bash export TT_METAL_HOME=/workspace/tt-mlir/third_party/tt-metal/src/tt-metal export TT_METAL_DEVICE_PROFILER=1 export TT_METAL_PROFILER_MID_RUN_DUMP=1 export TTLANG_AUTO_PROFILE=1 python examples/elementwise-tutorial/step_4_multinode_grid_auto.py ``` **Sample output:** ``` ==================================================================================================== THREAD: NCRISC [demo_read] (8000 ops, 160,260 cycles, 100.0% of total) ==================================================================================================== LINE %TIME CYCLES SOURCE ------ ------- ---------- ---------------------------------------------------------------------- 105 def demo_read(): 106 2.5% 18-50 core_x, core_y = ttl.node(dims=2) (x192, avg=21.3, total=4,086) 108 for core_row in range(rows_per_core): 109 3.7% 18-49 row = core_x * rows_per_core + core_row (x256, avg=23.4, total=5,996) 110 3.4% 18-41 start_row_tile = row * row_tiles_per_block (x256, avg=21.6, total=5,517) 111 5.1% 18-60 end_row_tile = (row + 1) * row_tiles_per_block (x384, avg=21.1, total=8,109) 113 for core_col in range(cols_per_core): 114 6.5% 18-49 col = core_y * cols_per_core + core_col (x512, avg=20.2, total=10,349) 115 6.7% 18-43 start_col_tile = col * col_tiles_per_block (x512, avg=20.8, total=10,669) 116 6.8% 18-55 end_col_tile = (col + 1) * col_tiles_per_block (x512, avg=21.2, total=10,863) 118 with ( 119 11.3% 18,068 a_cb.reserve() as a_blk, ├─ 10,162 cb_reserve (x512) ╰─ 7,906 cb_push (implicit) (x384) ``` See [auto-profiler-examples/](auto-profiler-examples/) for more complete sample outputs. > **Warning:** Each node supports only 125 signposts. Kernels with many operations in tight loops may overflow this buffer, causing later signposts to be silently dropped and mismatched cycle counts. See [#268](https://github.com/tenstorrent/tt-lang/issues/268) for details. ## User-Defined Signposts Use `ttl.signpost("name")` as a context manager to measure cycle counts for targeted code blocks instead of every line. This is useful when you only care about specific regions, or when auto-profiling overflows the signpost buffer. Signposts and auto-profiling must be used independently. If both are enabled, user signposts are skipped with a warning. **Important:** Each node supports only 125 signposts. To avoid overflowing the signpost buffer, update your operation to run only one iteration when profiling. Watch for warnings about buffer overflow in the output. **Required environment variables:** ```bash export TT_METAL_HOME=/path/to/tt-metal export TT_METAL_DEVICE_PROFILER=1 export TT_METAL_PROFILER_MID_RUN_DUMP=1 export TTLANG_SIGNPOST_PROFILE=1 ``` ### Example **Input program:** ```python @ttl.compute() def demo_compute(): with c_dfb.wait() as c_blk: for _ in range(rows): for _ in range(cols): with ( a_dfb.wait() as a_blk, b_dfb.wait() as b_blk, y_dfb.reserve() as y_blk, ): with ttl.signpost("compute"): with ttl.signpost("broadcast"): a_bcast = ttl.math.broadcast(a_blk, y_blk, dims=[1]) b_bcast = ttl.math.broadcast(b_blk, y_blk, dims=[0]) c_bcast = ttl.math.broadcast(c_blk, y_blk, dims=[0, 1]) with ttl.signpost("math"): tmp = a_bcast * b_bcast + c_bcast with ttl.signpost("store"): y_blk.store(tmp) ``` **Generated C++ (compute kernel):** ```cpp for (size_t k10 = v6; k10 < v4; k10 += v5) { for (size_t l11 = v6; l11 < v4; l11 += v5) { tile_regs_acquire(); { DeviceZoneScopedN("ttl_compute"); { DeviceZoneScopedN("ttl_broadcast"); unary_bcast_init(get_compile_time_arg_val(0), get_compile_time_arg_val(3)); unary_bcast(get_compile_time_arg_val(0), k10, v6); unary_bcast_init(get_compile_time_arg_val(1), get_compile_time_arg_val(3)); unary_bcast(get_compile_time_arg_val(1), l11, v5); mul_binary_tile_init(); mul_binary_tile(v6, v5, v6); unary_bcast_init(get_compile_time_arg_val(2), get_compile_time_arg_val(3)); unary_bcast(get_compile_time_arg_val(2), v6, v5); { DeviceZoneScopedN("ttl_math"); add_binary_tile_init(); add_binary_tile(v6, v5, v6); { DeviceZoneScopedN("ttl_store"); tile_regs_commit(); tile_regs_wait(); size_t v12 = k10 * v4; size_t v13 = v12 + l11; pack_tile(v6, get_compile_time_arg_val(3), v13); } } } } tile_regs_release(); } } ``` **Results:** ``` ================================================================================ SIGNPOST PROFILE ================================================================================ NAME THREAD COUNT TOTAL AVG MIN MAX ----------- ------------ ------ ------------ ---------- ---------- ---------- store TRISC_0 31 616 19 18 28 math TRISC_0 31 2,232 72 62 80 broadcast TRISC_0 31 21,005 677 517 806 compute TRISC_0 32 23,265 727 568 856 store TRISC_1 31 1,036 33 32 43 math TRISC_1 31 6,279 202 198 215 broadcast TRISC_1 31 24,436 788 719 2,535 compute TRISC_1 32 26,458 826 676 2,575 store TRISC_2 31 1,952 62 61 67 math TRISC_2 31 3,641 117 106 125 broadcast TRISC_2 31 22,881 738 585 2,171 compute TRISC_2 32 25,250 789 641 2,222 ``` Signposts can be nested as shown above. The report breaks down cycle counts per signpost region per hardware thread, showing count, total, average, min, and max cycles. ## Perfetto Trace Server Set `TTLANG_PERF_SERV=1` to serve profiler data as a [Perfetto](https://perfetto.dev/) trace after operation execution. This works with both auto-profiling and signpost profiling. The server converts the device profiler CSV to Chrome Trace Event format and opens it in the Perfetto UI. ![Perfetto trace server showing signpost regions per hardware thread](perfetto-trace-server.png) **Required environment variables:** ```bash # Setup: export TT_METAL_DEVICE_PROFILER=1 export TT_METAL_PROFILER_MID_RUN_DUMP=1 # With signpost profiling: export TTLANG_SIGNPOST_PROFILE=1 export TTLANG_PERF_SERV=1 python path/to/program.py # Or with auto-profiling: export TTLANG_AUTO_PROFILE=1 export TTLANG_PERF_SERV=1 python path/to/program.py ``` After operation execution, the server starts and prints connection instructions: ``` ====================================================================== TTLANG PERFETTO TRACE SERVER ====================================================================== 419 trace events ready Serving on port 48019 From your local machine, run: ssh -N -L 48019:172.17.0.2:48019 user@ Then open: http://localhost:48019 Press Enter to stop the server... ====================================================================== ``` If running directly on the server or inside a Docker container, use the SSH tunnel command shown in the output to forward the port to your local machine. On a local machine, open the URL directly. ### Standalone usage The server can also be run standalone against previously collected profiler logs. Like the perf summary tool, this works with any tt-metal program -- it parses the standard `profile_log_device.csv` that tt-metal's profiling infrastructure produces. ```bash python -m ttl._src.perf_trace_server --path /path/to/profiler/.logs/ ```