Matrix Multiplication
[1]:
import torch
import ttnn
torch.manual_seed(0)
device_id = 0
device = ttnn.open_device(device_id=device_id)
Metal | INFO | Initializing device 0
Device | INFO | Opening user mode device driver
2024-01-29 23:02:46.902 | INFO | SiliconDriver - Detected 1 PCI device : {0}
2024-01-29 23:02:46.913 | WARNING | SiliconDriver - init_detect_tt_device_numanodes(): Could not determine NumaNodeSet for TT device (physical_device_id: 0 pci_bus_id: 0000:00:08.0)
2024-01-29 23:02:46.913 | WARNING | SiliconDriver - Could not find NumaNodeSet for TT Device (physical_device_id: 0 pci_bus_id: 0000:00:08.0)
2024-01-29 23:02:46.915 | WARNING | SiliconDriver - bind_area_memory_nodeset(): Unable to determine TT Device to NumaNode mapping for physical_device_id: 0. Skipping membind.
---- ttSiliconDevice::init_hugepage: bind_area_to_memory_nodeset() failed (physical_device_id: 0 ch: 0). Hugepage allocation is not on NumaNode matching TT Device. Side-Effect is decreased Device->Host perf (Issue #893).
Metal | INFO | AI CLK for device 0 is: 1202 MHz
Enable program cache
Enabling the program cache will speed up the execution of operations that run repeatedly
[2]:
ttnn.enable_program_cache(device)
Op | INFO | Program Cache: enabled.
Configuration
[3]:
m = 1024
k = 1024
n = 1024
Initialize tensors a and b with random values using torch
[4]:
torch_a = torch.randn((m, k), dtype=torch.bfloat16)
torch_b = torch.randn((k, n), dtype=torch.bfloat16)
[5]:
a = ttnn.from_torch(torch_a, layout=ttnn.TILE_LAYOUT, device=device)
b = ttnn.from_torch(torch_b, layout=ttnn.TILE_LAYOUT, device=device)
Op | INFO | Finished Operation ttnn.from_torch in 175489 nanoseconds
Op | INFO | Finished Operation ttnn.to_device in 326608 nanoseconds
Op | INFO | Finished Operation ttnn.from_torch in 47769 nanoseconds
Op | INFO | Finished Operation ttnn.to_device in 165459 nanoseconds
Matrix multiply tensor a and b
The operation will run longer the first time because the kernels need to get compiled
[6]:
output = a @ b
Op | INFO | Finished Operation ttnn.reshape in 38930 nanoseconds
Op | INFO | Finished Operation ttnn.reshape in 35890 nanoseconds
Op | INFO | Finished Program tt::tt_metal::Matmul in 576872807 nanoseconds
Op | INFO | Finished Operation tt::tt_metal::Matmul in 577071926 nanoseconds
Op | INFO | Finished Operation ttnn.reshape in 99419 nanoseconds
Re-running the operation shows significant speed up by utilizing program caching
[7]:
output = a @ b
Op | INFO | Finished Operation ttnn.reshape in 39200 nanoseconds
Op | INFO | Finished Operation ttnn.reshape in 22440 nanoseconds
Op | INFO | Finished Program tt::tt_metal::Matmul in 1183694 nanoseconds
Op | INFO | Finished Operation tt::tt_metal::Matmul in 1224093 nanoseconds
Op | INFO | Finished Operation ttnn.reshape in 64480 nanoseconds
Inspect the layout of matrix multiplication output
[8]:
print(output.layout)
Layout.TILE
As can be seen, matrix multiplication produces outputs in a tile layout. That is because it’s much more efficient to use this layout for computing matrix multiplications on TensTorrent accelerators compared to a row-major layout.
And this is aslo why the logs show 2 tilize operations, as the inputs get automatically convered to the tile layout if they are in a row-major layout.
Learn more about tile layout here: TODO
Inspect the result of the matrix multiplication
To inspect the results we will first convert to row-major layout.
[9]:
output = ttnn.to_layout(output, ttnn.ROW_MAJOR_LAYOUT)
print("Printing ttnn tensor")
print(f"shape: {output.shape}")
print(f"chunk of a tensor:\n{output[:1, :32]}")
Op | INFO | Finished Program tt::tt_metal::Untilize in 508667002 nanosecondsPrinting ttnn tensor
shape: ttnn.Shape([1024, 1024])
Op | INFO | Finished Operation tt::tt_metal::Untilize in 508783061 nanoseconds
Op | INFO | Finished Operation ttnn.from_device in 1352602 nanoseconds
Op | INFO | Finished Operation ttnn.to_torch in 1744890 nanoseconds
chunk of a tensor:
Tensor([ [34.25, 9.625, 11.3125, 0.964844, 1.45312, -26.875, 23.125, -1.39062, -20.375, 33, 5.8125, 10.6875, -18.625, 14.5, -42.75, -18.375, 27.75, 44.25, -27.25, -20.5, 43.5, -5.75, -46.75, -45.75, 43.75, 33, -16.125, 39.25, 11.6875, 9.4375, -39.75, -6.5625]], dtype=bfloat16 )
Op | INFO | Finished Operation torch.Tensor.__getitem__ in 711456 nanoseconds
Op | INFO | Finished Operation ttnn.from_torch in 123629 nanoseconds
Op | INFO | Finished Operation ttnn.to_device in 190228 nanoseconds
Matrix multiply tensor a and b by using more performant config
By default, matrix multiplication might not be as effecient as it could be. To speed it up further, the user can specify how many cores they want matrix multiplication to use. This can speed up the operation significantly.
[10]:
a = ttnn.from_torch(torch_a)
b = ttnn.from_torch(torch_b)
a = ttnn.to_device(a, device, memory_config=ttnn.L1_MEMORY_CONFIG)
b = ttnn.to_device(b, device, memory_config=ttnn.L1_MEMORY_CONFIG)
a = ttnn.to_layout(a, ttnn.TILE_LAYOUT)
b = ttnn.to_layout(b, ttnn.TILE_LAYOUT)
Op | INFO | Finished Operation ttnn.from_torch in 46380 nanoseconds
Op | INFO | Finished Operation ttnn.from_torch in 33729 nanoseconds
Op | INFO | Finished Operation ttnn.to_device in 1330892 nanoseconds
Op | INFO | Finished Operation ttnn.to_device in 1996019 nanoseconds
Op | INFO | Finished Program tt::tt_metal::Tilize in 556706140 nanoseconds
Op | INFO | Finished Operation tt::tt_metal::Tilize in 556884870 nanoseconds
Op | INFO | Finished Program tt::tt_metal::Tilize in 424187 nanoseconds
Op | INFO | Finished Operation tt::tt_metal::Tilize in 473467 nanoseconds
Run once to compile the kernels
[11]:
output = ttnn.matmul(a, b, memory_config=ttnn.L1_MEMORY_CONFIG, core_grid=ttnn.CoreGrid(y=8, x=8))
Op | INFO | Finished Operation ttnn.reshape in 116419 nanoseconds
Op | INFO | Finished Operation ttnn.reshape in 27450 nanoseconds
Op | INFO | Finished Program tt::operations::primary::Matmul in 652476970 nanoseconds
Op | INFO | Finished Operation tt::operations::primary::Matmul in 652929758 nanoseconds
Op | INFO | Finished Operation ttnn.reshape in 86579 nanoseconds
Enjoy a massive speed up on the subsequent runs
[12]:
output = ttnn.matmul(a, b, memory_config=ttnn.L1_MEMORY_CONFIG, core_grid=ttnn.CoreGrid(y=8, x=8))
Op | INFO | Finished Operation ttnn.reshape in 38110 nanoseconds
Op | INFO | Finished Operation ttnn.reshape in 24079 nanoseconds
Op | INFO | Finished Program tt::operations::primary::Matmul in 129909 nanoseconds
Op | INFO | Finished Operation tt::operations::primary::Matmul in 164599 nanoseconds
Op | INFO | Finished Operation ttnn.reshape in 24209 nanoseconds
Close the device
[13]:
ttnn.close_device(device)
Metal | INFO | Closing device 0