Compiler Options¶
Code Generation Options¶
These flags control how TT-Lang compiles operations. Pass them on the command line,
or print the list with --ttl-help:
python my_kernel.py --ttl-help
python my_kernel.py --no-ttl-maximize-dst
Flag |
Default |
Description |
|---|---|---|
|
enabled |
Partition compute iteration spaces into subblocks that maximize DST register utilization, and reorder tile operations within sync regions to group by kind. Disabling falls back to per-tile synchronization. |
|
enabled |
Emit FPU binary elementwise ops ( |
|
enabled |
Emit |
|
disabled |
Let the compiler insert and move DFB synchronization ops. When enabled, reserve/push may be refined to per-subblock granularity. When disabled, user-placed reserve/push is preserved as written. |
|
enabled |
Combine consecutive |
Other Ways to Set These¶
Besides the command line, the same flags can be set through three other mechanisms. When the same flag is set in multiple places, higher-priority sources win and unmentioned flags fall through from lower levels:
Priority |
Mechanism |
Example |
|---|---|---|
1 (lowest) |
|
— |
2 |
|
|
3 |
|
|
4 (highest) |
Command-line arguments ( |
|
The options keyword can also be passed at call time to override the decorator
for a single invocation:
my_kernel(tensor_a, tensor_b, options="--no-ttl-fpu-binary-ops")
Compute Configuration¶
These two parameters are set on the @ttl.operation decorator (not via command-line
flags) and control the TTNN compute kernel hardware configuration:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Enable f32 accumulation in the DST register file. When |
|
|
|
Enable full DST synchronization (single-buffering mode). Doubles DST capacity (f32: 8, f16/bf16: 16) at the cost of a full sync between math and pack threads. |
@ttl.operation(grid=(2, 2), fp32_dest_acc_en=True, dst_full_sync_en=False)
def my_kernel(a, b): ...
Environment Variables¶
These environment variables control compilation behavior and diagnostic output. They are independent of the code generation flags above.
Variable |
Type |
Default |
Description |
|---|---|---|---|
|
|
|
Compile kernels but do not execute on hardware. |
|
file path |
(unset) |
Write the pre-optimization MLIR module to this file. |
|
file path |
(unset) |
Write the post-optimization MLIR module to this file. |
|
any value |
(unset) |
Print the IR after every pass in the pipeline. Output is very large; redirect to a file. |
|
|
|
Include source locations in printed MLIR (locations are always tracked internally for error messages). |
|
|
|
Include raw MLIR diagnostics in error output. |
Profiling-related environment variables (TTLANG_AUTO_PROFILE,
TTLANG_PERF_DUMP, TTLANG_PERF_SERV, TTLANG_SIGNPOST_PROFILE,
TTLANG_PROFILE_CSV) are documented in the
Performance Tools reference.
Other Decorator Parameters¶
The @ttl.operation decorator also accepts these parameters for operation structure
and layout:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
|
(required) |
Compute grid dimensions, e.g., |
|
|
|
Lambda functions for tile indexing |
|
|
|
|
|
|
|
Number of output tensor arguments |
|
|
|
Memory space for circular buffers: |
|
|
|
Use tiled tensor layout |
ttlang-opt Pass Reference¶
ttlang-opt is the standalone MLIR optimizer driver for the TTL dialect, used
primarily for compiler development and testing. It accepts all standard
mlir-opt flags (run ttlang-opt --help for the full list) plus the
TTL-specific passes and pipeline documented below.
Pipeline: ttl-to-ttkernel-pipeline¶
The main compilation pipeline, equivalent to what the Python API runs internally.
ttlang-opt input.mlir -p 'ttl-to-ttkernel-pipeline{maximize-dst=true lower-to-emitc=true}'
Option |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Enable DST maximization via subblock compute and scheduling. |
|
bool |
|
Use FPU for binary add/sub/mul. |
|
bool |
|
Lower matmul to block-level hardware calls ( |
|
bool |
|
Let the compiler insert and move DFB synchronization ops. |
|
bool |
|
Combine consecutive |
|
bool |
|
Run the TTKernel-to-EmitC backend (produces C++ source). |
The pipeline runs these passes in order:
convert-ttl-to-compute— lower TTL elementwise tensor ops tottl.computewith tile opsttl-set-compute-kernel-config— setfp32_dest_acc_en/dst_full_sync_endefaultsttl-assign-dst— DST register allocation (linear scan with copy insertion)ttl-subblock-compute-for-dst— tilettl.computeinto DST-sized subblocks (only ifmaximize-dst=true); optionally refine reserve/push to per-subblock granularity (only ifauto-sync=true)ttl-insert-tile-regs-sync— insert math/pack thread synchronizationttl-lower-matmul-block— mark block-matmul computes and expand stores (only ifuse-block-matmul=true)ttl-lower-to-loops— lowerttl.computetoscf.forloopsttl-schedule-operations— reorder tile ops by dependency depth and kind (only ifmaximize-dst=true)ttl-annotate-cb-associations— annotate block args with CB indicesconvert-ttl-to-ttkernel— lower TTL DMA ops to TTKernelttkernel-insert-inits— insert hardware init ops before compute opsttkernel-combine-pack-tiles— combine consecutivepack_tileintopack_tile_block(only ifcombine-pack-tiles=true)Canonicalization and CSE cleanup
(if
lower-to-emitc=true)lower-affine,convert-ttkernel-to-emitc,emitc-form-expressions
Individual Pass Options¶
Each pass can also be run standalone for testing. Only passes with configurable options are listed; the remaining passes have no options.
ttl-set-compute-kernel-config¶
Set default compute kernel configuration attributes on ttl.compute ops.
Option |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Default |
|
bool |
|
Default |
ttlang-opt input.mlir -p 'func.func(ttl-set-compute-kernel-config{fp32-dest-acc-en=1})'
ttl-assign-dst¶
DST register allocator using linear scan allocation with in-place operation merging.
Option |
Type |
Default |
Description |
|---|---|---|---|
|
uint32_t |
|
Override DST register capacity. Auto-computed from |
|
bool |
|
Allocate outputs in a separate DST region (needed for reductions and some loop optimizations). |
|
bool |
|
Use FPU for binary add/sub/mul when both operands come from CBs. When disabled, binary ops use the SFPU path. |
ttlang-opt input.mlir -p 'func.func(ttl-assign-dst{dst-capacity=16 enable-fpu-binary-ops=0})'
ttl-subblock-compute-for-dst¶
Partition ttl.compute into DST-sized subblocks.
Option |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Refine DFB reserve/push to per-subblock granularity, enabling |
ttlang-opt input.mlir -p 'func.func(ttl-subblock-compute-for-dst{subblock-sync=true})'
ttl-dump-cb-flow-graph¶
Analyze circular buffer producer/consumer relationships and dump the flow graph.
Option |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
Path to write JSON output. Empty string prints to stderr only. |
ttlang-opt input.mlir -p 'ttl-dump-cb-flow-graph{output="/tmp/cb_graph.json"}'