Compiler Options

Code Generation Options

These flags control how TT-Lang compiles operations. Pass them on the command line, or print the list with --ttl-help:

python my_kernel.py --ttl-help
python my_kernel.py --no-ttl-maximize-dst

Flag	Default	Description
`--ttl-maximize-dst` / `--no-ttl-maximize-dst`	enabled	Partition compute iteration spaces into subblocks that maximize DST register utilization, and reorder tile operations within sync regions to group by kind. Disabling falls back to per-tile synchronization.
`--ttl-fpu-binary-ops` / `--no-ttl-fpu-binary-ops`	enabled	Emit FPU binary elementwise ops (`add_tiles`, `sub_tiles`, `mul_tiles`) when both operands come from dataflow buffers. When disabled, binary ops use the SFPU path.
`--ttl-block-matmul` / `--no-ttl-block-matmul`	enabled	Emit `matmul_block` (processes the full tile block atomically) instead of per-tile matmul loops. Disabling this option is not yet supported.
`--ttl-subblock-sync` / `--no-ttl-subblock-sync`	disabled	Refine DFB reserve/push to per-subblock granularity, enabling `pack_tile_block` for contiguous subblocks. When disabled, user-placed reserve/push is preserved as written.
`--ttl-combine-pack-tiles` / `--no-ttl-combine-pack-tiles`	enabled	Combine consecutive `pack_tile` ops on the same DFB with contiguous DST and DFB indices into a single `pack_tile_block` call.
`--ttl-strict-f32-acc` / `--no-ttl-strict-f32-acc`	disabled	Error at compile time if a `+=` accumulation loop’s output block exceeds f32 DST capacity (4 tiles with double-buffering). When enabled, guarantees each accumulation step fits in a single DST section without subblocking.
`--ttl-compiler-dfbs` / `--no-ttl-compiler-dfbs`	enabled	Insert compiler-allocated intermediate DFBs at fusion split points where an operation requires DFB-attached inputs (reduce, broadcast, matmul, transpose). When disabled, the compiler emits an error if any fused computation requires an intermediate DFB.

Other Ways to Set These

Besides the command line, the same flags can be set through three other mechanisms. When the same flag is set in multiple places, higher-priority sources win and unmentioned flags fall through from lower levels:

Priority	Mechanism	Example
1 (lowest)	`CompilerOptions` class defaults	—
2	`@ttl.operation` decorator `options=` parameter	`@ttl.operation(grid=(2,2), options="--no-ttl-maximize-dst")`
3	`TTLANG_COMPILER_OPTIONS` environment variable	`export TTLANG_COMPILER_OPTIONS="--no-ttl-fpu-binary-ops"`
4 (highest)	Command-line arguments (`sys.argv`)	`python my_kernel.py --no-ttl-maximize-dst`

The options keyword can also be passed at call time to override the decorator for a single invocation:

my_kernel(tensor_a, tensor_b, options="--no-ttl-fpu-binary-ops")

Compute Configuration

These two parameters are set on the @ttl.operation decorator (not via command-line flags) and control the TTNN compute kernel hardware configuration:

Parameter	Type	Default	Description
`fp32_dest_acc_en`	`bool` or `None`	`None`	Enable f32 accumulation in the DST register file. When `None`, auto-detected from input tensor dtypes (enabled when any input is f32).
`dst_full_sync_en`	`bool` or `None`	`None`	Enable full DST synchronization (single-buffering mode). Doubles DST capacity (f32: 8, f16/bf16: 16) at the cost of a full sync between math and pack threads.

@ttl.operation(grid=(2, 2), fp32_dest_acc_en=True, dst_full_sync_en=False)
def my_kernel(a, b): ...

Environment Variables

These environment variables control compilation behavior and diagnostic output. They are independent of the code generation flags above.

Variable	Type	Default	Description
`TTLANG_COMPILE_ONLY`	`0`/`1`	`0`	Compile kernels but do not execute on hardware.
`TTLANG_INITIAL_MLIR`	file path	(unset)	Write the pre-optimization MLIR module to this file.
`TTLANG_FINAL_MLIR`	file path	(unset)	Write the post-optimization MLIR module to this file.
`TTLANG_VERBOSE_PASSES`	any value	(unset)	Print the IR after every pass in the pipeline. Output is very large; redirect to a file.
`TTLANG_DEBUG_LOCATIONS`	`0`/`1`	`0`	Include source locations in printed MLIR (locations are always tracked internally for error messages).
`TTLANG_VERBOSE_ERRORS`	`0`/`1`	`0`	Include raw MLIR diagnostics in error output.
`TTLANG_SIM_ONLY`	`0`/`1`	`0`	Force `import ttl` to skip loading the compiled MLIR extension. Used when running the simulator from a source tree without an installed `tt-lang-sim` wheel (which ships the same signal as a marker module).

Profiling-related environment variables (TTLANG_AUTO_PROFILE, TTLANG_PERF_DUMP, TTLANG_PERF_SERV, TTLANG_SIGNPOST_PROFILE, TTLANG_PROFILE_CSV) are documented in the Performance Tools reference.

Other Decorator Parameters

The @ttl.operation decorator also accepts these parameters for operation structure and layout:

Parameter	Type	Default	Description
`grid`	`tuple` or `Callable`	(required)	Compute grid dimensions, e.g., `(2, 2)`
`indexing_maps`	`list[Callable]`	`None`	Lambda functions for tile indexing
`iterator_types`	`list[str]`	`None`	`"parallel"` or `"reduction"` per dimension
`num_outs`	`int`	`1`	Number of output tensor arguments
`memory_space`	`str`	`"L1"`	Memory space for dataflow buffers: `"L1"` or `"DRAM"`
`tiled`	`bool`	`True`	Use tiled tensor layout

`ttlang-opt` Pass Reference

ttlang-opt is the standalone MLIR optimizer driver for the TTL dialect, used primarily for compiler development and testing. It accepts all standard mlir-opt flags (run ttlang-opt --help for the full list) plus the TTL-specific passes and pipeline documented below.

Pipeline: `ttl-to-ttkernel-pipeline`

The main compilation pipeline, equivalent to what the Python API runs internally.

ttlang-opt input.mlir -p 'ttl-to-ttkernel-pipeline{maximize-dst=true lower-to-emitc=true}'

Option	Type	Default	Description
`maximize-dst`	bool	`true`	Enable DST maximization via subblock compute and scheduling.
`enable-fpu-binary-ops`	bool	`true`	Use FPU for binary add/sub/mul.
`use-block-matmul`	bool	`true`	Lower matmul to block-level hardware calls (`matmul_block`).
`subblock-sync`	bool	`false`	Refine DFB reserve/push to per-subblock granularity.
`combine-pack-tiles`	bool	`true`	Combine consecutive `pack_tile` ops into `pack_tile_block`.
`strict-f32-acc`	bool	`false`	Error if a `+=` accumulation loop’s output block exceeds f32 DST capacity.
`compiler-dfbs`	bool	`true`	Insert compiler-allocated intermediate DFBs for fused computations. Error if disabled and any operation requires one.
`lower-to-emitc`	bool	`false`	Run the TTKernel-to-EmitC backend (produces C++ source).

The pipeline runs these passes in order:

ttl-insert-intermediate-dfbs — allocate compiler-managed DFBs for intermediate values (transposes, etc.); verify and error when compiler-dfbs=false
ttl-insert-copy-wait — insert missing ttl.wait after ttl.copy ops whose transfer handle has no wait user
ttl-insert-cb-sync — insert DFB wait/pop/reserve/push around compute regions
ttl-annotate-l1-acc-loops — detect += accumulation loops and annotate for L1 packer accumulation
convert-ttl-to-compute — lower TTL elementwise tensor ops to ttl.compute with tile ops
ttl-set-compute-kernel-config — set fp32_dest_acc_en / dst_full_sync_en defaults
ttl-assign-dst — DST register allocation (linear scan with copy insertion)
ttl-subblock-compute-for-dst — tile ttl.compute into DST-sized subblocks (only if maximize-dst=true); optionally refine reserve/push to per-subblock granularity (only if subblock-sync=true)
ttl-insert-tile-regs-sync — insert math/pack thread synchronization
ttl-lower-to-loops — lower ttl.compute to scf.for loops; matmul computes are expanded inline via generateMatmulCompute
ttl-schedule-operations — reorder tile ops by dependency depth and kind (only if maximize-dst=true)
ttl-annotate-cb-associations — annotate block args with DFB indices
convert-ttl-to-ttkernel — lower TTL DMA ops to TTKernel
ttkernel-insert-inits — insert hardware init ops before compute ops
ttkernel-insert-l1-accumulation — insert pack_reconfig_l1_acc guards for += and reduction loops
ttkernel-combine-pack-tiles — combine consecutive pack_tile into pack_tile_block (only if combine-pack-tiles=true)
Canonicalization and CSE cleanup
(if lower-to-emitc=true) lower-affine, convert-ttkernel-to-emitc, emitc-form-expressions

Individual Pass Options

Each pass can also be run standalone for testing. Only passes with configurable options are listed; the remaining passes have no options.

`ttl-insert-intermediate-dfbs`

Insert compiler-allocated intermediate DFBs at fusion split points.

Option	Type	Default	Description
`enable`	bool	`true`	Insert compiler-allocated DFBs. When false, emit an error if any operation requires one.

ttlang-opt input.mlir -p 'func.func(ttl-insert-intermediate-dfbs{enable=false})'

`ttl-set-compute-kernel-config`

Set default compute kernel configuration attributes on ttl.compute ops.

Option	Type	Default	Description
`fp32-dest-acc-en`	bool	`false`	Default `fp32_dest_acc_en` when not already configured.
`dst-full-sync-en`	bool	`false`	Default `dst_full_sync_en` when not already configured.

ttlang-opt input.mlir -p 'func.func(ttl-set-compute-kernel-config{fp32-dest-acc-en=1})'

`ttl-assign-dst`

DST register allocator using linear scan allocation with in-place operation merging.

Option	Type	Default	Description
`dst-capacity`	uint32_t	`0` (auto)	Override DST register capacity. Auto-computed from `fp32_dest_acc_en` and `dst_full_sync_en` by default. Single-buffering (`dst_full_sync_en=true`): f32=8, f16/bf16=16. Double-buffering (default): f32=4, f16/bf16=8.
`separate-output-region`	bool	`false`	Allocate outputs in a separate DST region (needed for reductions and some loop optimizations).
`enable-fpu-binary-ops`	bool	`true`	Use FPU for binary add/sub/mul when both operands come from DFBs. When disabled, binary ops use the SFPU path.

ttlang-opt input.mlir -p 'func.func(ttl-assign-dst{dst-capacity=16 enable-fpu-binary-ops=0})'

`ttl-subblock-compute-for-dst`

Partition ttl.compute into DST-sized subblocks.

Option	Type	Default	Description
`subblock-sync`	bool	`false`	Refine DFB reserve/push to per-subblock granularity, enabling `pack_tile_block` for contiguous subblocks. When disabled, user-placed reserve/push is preserved.
`strict-f32-acc`	bool	`false`	Error if a `+=` accumulation loop with non-f32 output requires subblocking. Subblocking reduces accumulation precision because bf16 L1 intermediates truncate f32 DST values.

ttlang-opt input.mlir -p 'func.func(ttl-subblock-compute-for-dst{subblock-sync=true})'

`ttl-dump-cb-flow-graph`

Analyze dataflow buffer producer/consumer relationships and dump the flow graph.

Option	Type	Default	Description
`output`	string	`""`	Path to write JSON output. Empty string prints to stderr only.

ttlang-opt input.mlir -p 'ttl-dump-cb-flow-graph{output="/tmp/cb_graph.json"}'

Compiler Options

Code Generation Options

Other Ways to Set These

Compute Configuration

Environment Variables

Other Decorator Parameters

ttlang-opt Pass Reference

Pipeline: ttl-to-ttkernel-pipeline

Individual Pass Options

ttl-insert-intermediate-dfbs

ttl-set-compute-kernel-config

ttl-assign-dst

ttl-subblock-compute-for-dst

ttl-dump-cb-flow-graph