Matrix Multiplication

[1]:
import torch
import ttnn

torch.manual_seed(0)

device_id = 0
device = ttnn.open_device(device_id=device_id)
                  Metal | INFO     | Initializing device 0
                 Device | INFO     | Opening user mode device driver
2024-01-29 23:02:46.902 | INFO     | SiliconDriver   - Detected 1 PCI device : {0}
2024-01-29 23:02:46.913 | WARNING  | SiliconDriver   - init_detect_tt_device_numanodes(): Could not determine NumaNodeSet for TT device (physical_device_id: 0 pci_bus_id: 0000:00:08.0)
2024-01-29 23:02:46.913 | WARNING  | SiliconDriver   - Could not find NumaNodeSet for TT Device (physical_device_id: 0 pci_bus_id: 0000:00:08.0)
2024-01-29 23:02:46.915 | WARNING  | SiliconDriver   - bind_area_memory_nodeset(): Unable to determine TT Device to NumaNode mapping for physical_device_id: 0. Skipping membind.
---- ttSiliconDevice::init_hugepage: bind_area_to_memory_nodeset() failed (physical_device_id: 0 ch: 0). Hugepage allocation is not on NumaNode matching TT Device. Side-Effect is decreased Device->Host perf (Issue #893).
                  Metal | INFO     | AI CLK for device 0 is:   1202 MHz

Enable program cache

Enabling the program cache will speed up the execution of operations that run repeatedly

[2]:
ttnn.enable_program_cache(device)
                     Op | INFO     | Program Cache: enabled.

Configuration

[3]:
m = 1024
k = 1024
n = 1024

Initialize tensors a and b with random values using torch

[4]:
torch_a = torch.randn((m, k), dtype=torch.bfloat16)
torch_b = torch.randn((k, n), dtype=torch.bfloat16)
[5]:
a = ttnn.from_torch(torch_a, layout=ttnn.TILE_LAYOUT, device=device)
b = ttnn.from_torch(torch_b, layout=ttnn.TILE_LAYOUT, device=device)
                     Op | INFO     | Finished Operation ttnn.from_torch                                    in          175489 nanoseconds
                     Op | INFO     | Finished Operation ttnn.to_device                                     in          326608 nanoseconds
                     Op | INFO     | Finished Operation ttnn.from_torch                                    in           47769 nanoseconds
                     Op | INFO     | Finished Operation ttnn.to_device                                     in          165459 nanoseconds

Matrix multiply tensor a and b

The operation will run longer the first time because the kernels need to get compiled

[6]:
output = a @ b
                     Op | INFO     | Finished Operation ttnn.reshape                                       in           38930 nanoseconds
                     Op | INFO     | Finished Operation ttnn.reshape                                       in           35890 nanoseconds
                     Op | INFO     | Finished Program   tt::tt_metal::Matmul                               in       576872807 nanoseconds
                     Op | INFO     | Finished Operation tt::tt_metal::Matmul                               in       577071926 nanoseconds
                     Op | INFO     | Finished Operation ttnn.reshape                                       in           99419 nanoseconds

Re-running the operation shows significant speed up by utilizing program caching

[7]:
output = a @ b
                     Op | INFO     | Finished Operation ttnn.reshape                                       in           39200 nanoseconds
                     Op | INFO     | Finished Operation ttnn.reshape                                       in           22440 nanoseconds
                     Op | INFO     | Finished Program   tt::tt_metal::Matmul                               in         1183694 nanoseconds
                     Op | INFO     | Finished Operation tt::tt_metal::Matmul                               in         1224093 nanoseconds
                     Op | INFO     | Finished Operation ttnn.reshape                                       in           64480 nanoseconds

Inspect the layout of matrix multiplication output

[8]:
print(output.layout)
Layout.TILE

As can be seen, matrix multiplication produces outputs in a tile layout. That is because it’s much more efficient to use this layout for computing matrix multiplications on TensTorrent accelerators compared to a row-major layout.

And this is aslo why the logs show 2 tilize operations, as the inputs get automatically convered to the tile layout if they are in a row-major layout.

Learn more about tile layout here: TODO

Inspect the result of the matrix multiplication

To inspect the results we will first convert to row-major layout.

[9]:
output = ttnn.to_layout(output, ttnn.ROW_MAJOR_LAYOUT)

print("Printing ttnn tensor")
print(f"shape: {output.shape}")
print(f"chunk of a tensor:\n{output[:1, :32]}")
                     Op | INFO     | Finished Program   tt::tt_metal::Untilize                             in       508667002 nanosecondsPrinting ttnn tensor
shape: ttnn.Shape([1024, 1024])

                     Op | INFO     | Finished Operation tt::tt_metal::Untilize                             in       508783061 nanoseconds
                     Op | INFO     | Finished Operation ttnn.from_device                                   in         1352602 nanoseconds
                     Op | INFO     | Finished Operation ttnn.to_torch                                      in         1744890 nanoseconds
chunk of a tensor:
Tensor([ [34.25, 9.625, 11.3125, 0.964844, 1.45312, -26.875, 23.125, -1.39062, -20.375, 33, 5.8125, 10.6875, -18.625, 14.5, -42.75, -18.375, 27.75, 44.25, -27.25, -20.5, 43.5, -5.75, -46.75, -45.75, 43.75, 33, -16.125, 39.25, 11.6875, 9.4375, -39.75, -6.5625]], dtype=bfloat16 )

                     Op | INFO     | Finished Operation torch.Tensor.__getitem__                           in          711456 nanoseconds
                     Op | INFO     | Finished Operation ttnn.from_torch                                    in          123629 nanoseconds
                     Op | INFO     | Finished Operation ttnn.to_device                                     in          190228 nanoseconds

Matrix multiply tensor a and b by using more performant config

By default, matrix multiplication might not be as effecient as it could be. To speed it up further, the user can specify how many cores they want matrix multiplication to use. This can speed up the operation significantly.

[10]:
a = ttnn.from_torch(torch_a)
b = ttnn.from_torch(torch_b)

a = ttnn.to_device(a, device, memory_config=ttnn.L1_MEMORY_CONFIG)
b = ttnn.to_device(b, device, memory_config=ttnn.L1_MEMORY_CONFIG)

a = ttnn.to_layout(a, ttnn.TILE_LAYOUT)
b = ttnn.to_layout(b, ttnn.TILE_LAYOUT)
                     Op | INFO     | Finished Operation ttnn.from_torch                                    in           46380 nanoseconds
                     Op | INFO     | Finished Operation ttnn.from_torch                                    in           33729 nanoseconds
                     Op | INFO     | Finished Operation ttnn.to_device                                     in         1330892 nanoseconds
                     Op | INFO     | Finished Operation ttnn.to_device                                     in         1996019 nanoseconds
                     Op | INFO     | Finished Program   tt::tt_metal::Tilize                               in       556706140 nanoseconds
                     Op | INFO     | Finished Operation tt::tt_metal::Tilize                               in       556884870 nanoseconds
                     Op | INFO     | Finished Program   tt::tt_metal::Tilize                               in          424187 nanoseconds
                     Op | INFO     | Finished Operation tt::tt_metal::Tilize                               in          473467 nanoseconds

Run once to compile the kernels

[11]:
output = ttnn.matmul(a, b, memory_config=ttnn.L1_MEMORY_CONFIG, core_grid=ttnn.CoreGrid(y=8, x=8))
                     Op | INFO     | Finished Operation ttnn.reshape                                       in          116419 nanoseconds
                     Op | INFO     | Finished Operation ttnn.reshape                                       in           27450 nanoseconds
                     Op | INFO     | Finished Program   tt::operations::primary::Matmul                    in       652476970 nanoseconds
                     Op | INFO     | Finished Operation tt::operations::primary::Matmul                    in       652929758 nanoseconds
                     Op | INFO     | Finished Operation ttnn.reshape                                       in           86579 nanoseconds

Enjoy a massive speed up on the subsequent runs

[12]:
output = ttnn.matmul(a, b, memory_config=ttnn.L1_MEMORY_CONFIG, core_grid=ttnn.CoreGrid(y=8, x=8))
                     Op | INFO     | Finished Operation ttnn.reshape                                       in           38110 nanoseconds
                     Op | INFO     | Finished Operation ttnn.reshape                                       in           24079 nanoseconds
                     Op | INFO     | Finished Program   tt::operations::primary::Matmul                    in          129909 nanoseconds
                     Op | INFO     | Finished Operation tt::operations::primary::Matmul                    in          164599 nanoseconds
                     Op | INFO     | Finished Operation ttnn.reshape                                       in           24209 nanoseconds

Close the device

[13]:
ttnn.close_device(device)
                  Metal | INFO     | Closing device 0