Matrix Multiplication

This tutorial demonstrates how to perform matrix multiplication operations using TT-NN, showcasing different memory configurations and layout conversions. We’ll explore how to create random tensors on device, perform matrix multiplication, and configure operations for optimal performance on Tenstorrent hardware.

Import Libraries

[ ]:
import ttnn

Open the Device

Create a device to run our matrix multiplication operations Device ID 0 typically refers to the first available Tenstorrent accelerator

[ ]:
device_id = 0
device = ttnn.open_device(device_id=device_id)

Tensor Configuration

Set up dimensions for our matrix multiplication: A(m×k) × B(k×n) = C(m×n). Using 1024×1024 matrices for this example (32×32 tiles, with 32 tiles per dimension)

[ ]:
m = 1024  # Number of rows in matrix A and result
k = 1024  # Number of columns in A / rows in B (must match for valid matmul)
n = 1024  # Number of columns in matrix B and result

Initialize tensors a and b with random values

Create random tensors directly on the device using TILE_LAYOUT. TILE_LAYOUT is optimized for Tensix cores which operate on 32×32 tiles. Using bfloat16 for efficient computation with good numerical range

[ ]:
a = ttnn.rand((m, k), dtype=ttnn.bfloat16, device=device, layout=ttnn.TILE_LAYOUT)
b = ttnn.rand((k, n), dtype=ttnn.bfloat16, device=device, layout=ttnn.TILE_LAYOUT)

Matrix multiply tensor a and b

Perform matrix multiplication using the @ operator. This is equivalent to ttnn.matmul with default settings.

The operation will run longer the first time because the kernels need to get compiled.

[ ]:
output = a @ b

Re-running the operation shows significant speed up by utilizing program caching

[ ]:
output = a @ b

Inspect the layout of matrix multiplication output

Print the current layout of the output tensor.

[ ]:
print(output.layout)

As can be seen, matrix multiplication produces outputs in a tile layout. That is because it’s much more efficient to use this layout for computing matrix multiplications on Tenstorrent accelerators compared to a row-major layout.

And this is also why the logs show 2 tilize operations, as the inputs get automatically convered to the tile layout if they are in a row-major layout.

Learn more about tile layout here

Inspect the result of the matrix multiplication

To inspect the results we will first convert to row-major layout.

[ ]:
output = ttnn.to_layout(output, ttnn.ROW_MAJOR_LAYOUT)

print("Printing ttnn tensor")
print(f"shape: {output.shape}")
print(f"chunk of a tensor:\n{output[:1, :32]}")

Matrix multiply tensor a and b by using more performant config

By default, matrix multiplication might not be as effecient as it could be. To speed it up further, the user can specify how many cores they want matrix multiplication to use. This can speed up the operation significantly.

[ ]:
a = ttnn.rand((m, k), dtype=ttnn.bfloat16, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)
b = ttnn.rand((k, n), dtype=ttnn.bfloat16, device=device, layout=ttnn.TILE_LAYOUT, memory_config=ttnn.L1_MEMORY_CONFIG)

Run once to compile the kernels

[ ]:
output = ttnn.matmul(a, b, memory_config=ttnn.L1_MEMORY_CONFIG, core_grid=ttnn.CoreGrid(y=8, x=8))

Enjoy a massive speed up on the subsequent runs

[ ]:
output = ttnn.matmul(a, b, memory_config=ttnn.L1_MEMORY_CONFIG, core_grid=ttnn.CoreGrid(y=8, x=8))

Close the device

[ ]:
ttnn.close_device(device)

Full Example and Output

Lets put everything together in a complete example that can be run directly.

ttnn_add_tensors.py

Running this script will generate the following output:

$ python3 $TT_METAL_HOME/ttnn/tutorials/basic_python/ttnn_basic_matrix_multiplication.py
2025-10-23 09:03:21.386 | info     |          Device | Opening user mode device driver (tt_cluster.cpp:209)
2025-10-23 09:03:21.512 | info     |             UMD | Harvesting mask for chip 0 is 0x20 (NOC0: 0x20, simulated harvesting mask: 0x0). (cluster.cpp:394)
2025-10-23 09:03:21.751 | info     |             UMD | Opening local chip ids/PCIe ids: {0}/[2] and remote chip ids {} (cluster.cpp:252)
2025-10-23 09:03:21.751 | info     |             UMD | All devices in cluster running firmware version: 18.10.0 (cluster.cpp:232)
2025-10-23 09:03:21.751 | info     |             UMD | IOMMU: disabled (cluster.cpp:174)
2025-10-23 09:03:21.751 | info     |             UMD | KMD version: 2.4.0 (cluster.cpp:177)
2025-10-23 09:03:21.752 | info     |             UMD | Software version 6.0.0, Ethernet FW version 7.0.0 (Device 0) (cluster.cpp:1085)
2025-10-23 09:03:21.765 | info     |             UMD | Pinning pages for Hugepage: virtual address 0x7f5480000000 and size 0x40000000 pinned to physical address 0x4c0000000 (pci_device.cpp:536)
Layout.TILE
Printing ttnn tensor
shape: Shape([1024, 1024])
chunk of a tensor:
ttnn.Tensor([[258.0000, 260.0000,  ..., 266.0000, 272.0000]], shape=Shape([1, 32]), dtype=DataType::BFLOAT16, layout=Layout::ROW_MAJOR)
2025-10-23 09:03:46.028 | info     |          Device | Closing user mode device drivers (tt_cluster.cpp:426)