# Elementwise Operation Tutorial

This tutorial walks through building a fused elementwise operation in TT-Lang,
introducing one concept at a time. Each step is a self-contained runnable
script.

## The Goal

We want to compute `y = (a * b + c) * d` on 2048×2048 `bfloat16` tensors. The
inner expression `a * b + c` is the target for kernel fusion: instead of
dispatching three separate TT-NN operations that each read and write DRAM, a
custom TT-Lang operation reads each input once, computes the result in L1, and writes
output once. It is possible to vary the expression as well as the size of
tensors and the data type, for example `float32`. We ecougarge the user to do this.

## Step 0 — TT-NN Baseline

**Script**: [`examples/elementwise-tutorial/step_0_ttnn_base.py`](https://github.com/tenstorrent/tt-lang/blob/main/examples/elementwise-tutorial/step_0_ttnn_base.py)

The starting point uses TT-NN directly, with no custom operation:

```python
y = ttnn.multiply(ttnn.add(ttnn.multiply(a, b), c), d)
```

Each call dispatches a separate operation and writes an intermediate tensor back
to DRAM. This is the reference we'll verify against as we build the custom
operation.

## Step 1 — Single Node, Single-Tile Block

**Script**: [`examples/elementwise-tutorial/step_1_single_node_single_tile_block.py`](https://github.com/tenstorrent/tt-lang/blob/main/examples/elementwise-tutorial/step_1_single_node_single_tile_block.py)

This step introduces the complete TT-Lang programming model. The operation fuses
`a * b + c` into a single pass, processing one 32×32 tile at a time on one
node.

### Operation function and grid

An operation is a Python function decorated with `@ttl.operation()`. The `grid`
argument selects how many nodes (Tensix cores) to run on. `grid=(1, 1)` means
a single node.

```python
@ttl.operation(grid=(1, 1))
def __tutorial_operation(a: ttnn.Tensor, b: ttnn.Tensor, c: ttnn.Tensor, y: ttnn.Tensor):
    ...
```

The function arguments are the tensors the operation operates on. They live in
DRAM on device and are passed by the host at call time.

### Dataflow buffers

A *dataflow buffer* (DFB) is an L1 buffer shared between kernel functions
within a node. It is created once in the operation scope from a tensor likeness
and a block shape:

```python
a_dfb = ttl.make_dataflow_buffer_like(a, shape=(1, 1), buffer_factor=2)
```

`shape=(1, 1)` means each buffer entry holds one 32×32 tile. `buffer_factor=2`
allocates two entries in L1 so that the reader and compute kernels can work
concurrently — while compute processes one entry, the reader fills the other
(double-buffering).

### Kernel functions

Three kernel functions run concurrently inside the operation:

```python
@ttl.compute()
def tutorial_compute(): ...

@ttl.datamovement()
def tutorial_read(): ...

@ttl.datamovement()
def tutorial_write(): ...
```

**Compute kernel** — waits for filled input blocks and reserves output blocks,
then runs the fused expression:

```python
with (
    a_dfb.wait() as a_blk,
    b_dfb.wait() as b_blk,
    c_dfb.wait() as c_blk,
    y_dfb.reserve() as y_blk,
):
    y_blk.store(a_blk * b_blk + c_blk)
```

`wait()` blocks until the reader has pushed a filled tile. `reserve()` blocks
until the writer has freed an entry. The `with` block automatically calls `pop()`
on inputs and `push()` on the output when the scope exits.

**Reader DM kernel** — copies tiles from DRAM into the input DFBs:

```python
with (
    a_dfb.reserve() as a_blk,
    b_dfb.reserve() as b_blk,
    c_dfb.reserve() as c_blk,
):
    tx_a = ttl.copy(a[row, col], a_blk)
    tx_b = ttl.copy(b[row, col], b_blk)
    tx_c = ttl.copy(c[row, col], c_blk)
    tx_a.wait(); tx_b.wait(); tx_c.wait()
```

`ttl.copy` starts a transfer; `tx.wait()` waits for it to complete. The
index `a[row, col]` selects a tile in *tile coordinates* (not element
coordinates). The `with` block calls `push()` on exit, signalling the compute
kernel.

**Writer DM kernel** — copies computed output tiles from L1 back to DRAM:

```python
with y_dfb.wait() as y_blk:
    tx = ttl.copy(y_blk, y[row, col])
    tx.wait()
```

## Step 2 — Single Node, Multi-Tile Block

**Script**: [`examples/elementwise-tutorial/step_2_single_node_multitile_block.py`](https://github.com/tenstorrent/tt-lang/blob/main/examples/elementwise-tutorial/step_2_single_node_multitile_block.py)

Processing one tile at a time incurs a synchronization (via dataflow buffers)
round-trip per tile. This step groups tiles into larger blocks so that each
transfer and compute iteration covers a `GRANULARITY × GRANULARITY` patch of tiles.

```python
GRANULARITY = 4  # each block is a 4×4 patch of 32×32 tiles = 128×128 elements

a_dfb = ttl.make_dataflow_buffer_like(
    a, shape=(row_tiles_per_block, col_tiles_per_block), buffer_factor=2
)
```

The iteration counts change from individual tiles to blocks:

```python
rows = a.shape[0] // TILE_SIZE // row_tiles_per_block
cols = a.shape[1] // TILE_SIZE // col_tiles_per_block
```

The reader selects a tile range (not a single tile) per transfer:

```python
tx_a = ttl.copy(
    a[start_row_tile:end_row_tile, start_col_tile:end_col_tile],
    a_blk,
)
```

The operation structure, synchronization pattern, and compute expression are
unchanged from Step 1.

## Step 3 — Multi-Node, Fixed Grid

**Script**: [`examples/elementwise-tutorial/step_3_multinode.py`](https://github.com/tenstorrent/tt-lang/blob/main/examples/elementwise-tutorial/step_3_multinode.py)

This step parallelizes the operation across a 4×4 grid of nodes. Each node
processes an independent rectangular region of the tensor. To familiarize
the user with Tenstorrent hardware architecture we recommend reading
[TT Architecture and Metalium Guide](https://github.com/tenstorrent/tt-metal/blob/main/METALIUM_GUIDE.md).

### Declaring a multi-node grid

```python
@ttl.operation(grid=(4, 4))
def __tutorial_operation(...):
```

All nodes execute the same operation body. They differentiate their work using
their coordinates in the grid as explained in the next sections.

### Querying grid size and node position

`ttl.grid_size(dims=2)` returns `(cols, rows)` — the number of nodes along
each grid dimension. `ttl.node(dims=2)` returns the `(col, row)` coordinates
of the current node, zero-based.

```python
grid_cols, grid_rows = ttl.grid_size(dims=2)

rows_per_node = a.shape[0] // TILE_SIZE // row_tiles_per_block // grid_rows
cols_per_node = a.shape[1] // TILE_SIZE // col_tiles_per_block // grid_cols
```

### Mapping local to global indices

Each DM kernel uses its node coordinates to offset into the global tensor:

```python
node_col, node_row = ttl.node(dims=2)

for local_row in range(rows_per_node):
    row = node_row * rows_per_node + local_row
    ...
for local_col in range(cols_per_node):
    col = node_col * cols_per_node + local_col
    ...
```

In this particular example, the compute kernel is unaware of node coordinates — it
simply processes all blocks that the DM kernels deliver to it.

This version requires the tensor dimensions to be evenly divisible by the grid.
See Step 4 for a version that handles arbitrary sizes.

## Step 4 — Multi-Node, Auto Grid

**Script**: [`examples/elementwise-tutorial/step_4_multinode_grid_auto.py`](https://github.com/tenstorrent/tt-lang/blob/main/examples/elementwise-tutorial/step_4_multinode_grid_auto.py)

This step removes two constraints from Step 3: the hard-coded grid size and
the requirement for even divisibility.

### Auto grid

```python
@ttl.operation(grid="auto")
```

`grid="auto"` lets the compiler select the largest grid that fits available
hardware resources. The operation must work correctly for any grid the compiler may
choose as elaborated next.

### Ceiling division

When the number of blocks does not divide evenly across the grid, nodes at the
trailing edge would be left without work. Ceiling division ensures every block
is assigned to some node:

```python
rows_per_node = -(-rows // grid_rows)  # ceil(rows / grid_rows)
cols_per_node = -(-cols // grid_cols)  # ceil(cols / grid_cols)
```

### Bounds checking

Nodes at the trailing edge may be assigned more iterations than there are
actual blocks. All three kernel functions guard per-block work:

```python
for local_row in range(rows_per_node):
    row = node_row * rows_per_node + local_row
    if row < rows:          # skip if past the end of the tensor
        for local_col in range(cols_per_node):
            col = node_col * cols_per_node + local_col
            if col < cols:  # skip if past the end of the tensor
                ...
```

The guard must appear in every kernel function — compute, read, and write —
so that they all agree on exactly which blocks to process.