MLP Inference
In this example we will combine insight from the previous examples, and use TT-NN with PyTorch to perform a simple MLP inference task. This will demonstrate how to use TT-NN for tensor operations and model inference.
Lets create the example file, ttnn_mlp_inference_mnist.py
Import Libraries
In this script, a set of essential libraries are imported to perform inference on the MNIST digit classification task using a multi-layer perceptron (MLP) accelerated by Tenstorrent hardware. The torch library provides utilities for loading and manipulating data, while torchvision and its transforms submodule download the MNIST dataset and apply normalization and tensor conversion to the image inputs. The TT-NN library is the core interface for compiling and running neural network operations on Tenstorrent devices, including tensor creation, data layout transformation, and layer computation (e.g., linear, relu). The OS module checks for a pretrained weights file on disk. Finally, loguru provides clear and structured logging throughout the script, including loading status, prediction results, and final accuracy reporting. These imports enable a pipeline that loads data, runs inference on a custom backend, and logs the outcome efficiently.
[ ]:
import torch
import torchvision
import torchvision.transforms as transforms
import numpy as np
import ttnn
import os
from loguru import logger
Open the Device
Create the device to run the program.
[ ]:
# Open Tenstorrent device
device = ttnn.open_device(device_id=0)
Load MNIST Test Data
Load and convert MNIST 28x28 grayscale images to tensors for normalization. Subsequently, create a DataLoader to iterate through the dataset. This will allow us to perform inference on each image in the dataset.
[ ]:
# Load MNIST data
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
testset = torchvision.datasets.MNIST(root="./data", train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=1, shuffle=False)
Load Pretrained MLP Weights
Load the pretrained MLP weights from a file. Run the following script train_and_export_mlp.py Alternatively, if the weights file is not found, random weights values will be generated to test functionality, but expect poor prediction results.
[ ]:
if os.path.exists("mlp_mnist_weights.pt"):
# Pretrained weights
weights = torch.load("mlp_mnist_weights.pt")
W1 = ttnn.from_torch(weights["W1"], dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
b1 = ttnn.from_torch(weights["b1"], dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
W2 = ttnn.from_torch(weights["W2"], dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
b2 = ttnn.from_torch(weights["b2"], dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
W3 = ttnn.from_torch(weights["W3"], dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
b3 = ttnn.from_torch(weights["b3"], dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
logger.info("Loaded pretrained weights from mlp_mnist_weights.pt")
else:
# Random weights for MLP - will not predict correctly
logger.warning("mlp_mnist_weights.pt not found, using random weights")
W1 = ttnn.rand((128, 28 * 28), dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
b1 = ttnn.rand((128,), dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
W2 = ttnn.rand((64, 128), dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
b2 = ttnn.rand((64,), dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
W3 = ttnn.rand((10, 64), dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
b3 = ttnn.rand((10,), dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
Accuracy Tracking, Inference, Loop, and Image Flattening
The following code snippet performs inference on the first five test samples from the MNIST dataset using a multi-layer perceptron (MLP) executed on Tenstorrent hardware via the TT-NN API. It initializes counters for tracking the number of correct predictions. For each sample, the input image is flattened into a 1D vector and converted from a PyTorch tensor to a TT-NN tensor with bfloat16 precision and TILE_LAYOUT for efficient execution. The tensor is then passed sequentially through three fully connected layers: each of the first two layers applies a linear transformation followed by a ReLU activation, while the final layer produces raw logits for the 10 output classes (digits 0–9). For each layer, the weights are transposed and the biases are reshaped to match TT-NN’s expected input dimensions. After computing the final output, it is converted back to a PyTorch tensor, and the class with the highest activation is selected as the predicted label. The prediction is compared with the true label to update the accuracy counters, and the result is logged. Once all five samples are processed, the script logs the overall prediction accuracy.
[ ]:
correct = 0
total = 0
for i, (image, label) in enumerate(testloader):
if i >= 5:
break
image = image.view(1, -1).to(torch.float32)
# Convert to TT-NN Tensor
# Convert the PyTorch tensor to TT-NN format with bfloat16 data type and
# TILE\_LAYOUT. This is necessary for efficient computation on the
# Tenstorrent device.
image_tt = ttnn.from_torch(image, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
# Layer 1
# Transposed weights are used to match TT-NN's expected shape. Bias
# reshaped to 1x128 for broadcasting, and compute output 1.
W1_final = ttnn.transpose(W1, -2, -1)
b1_final = ttnn.reshape(b1, [1, -1])
out1 = ttnn.linear(image_tt, W1_final, bias=b1_final)
out1 = ttnn.relu(out1)
# Layer 2
# Same pattern as Layer 1, but with different weights and biases.
W2_final = ttnn.transpose(W2, -2, -1)
b2_final = ttnn.reshape(b2, [1, -1])
out2 = ttnn.linear(out1, W2_final, bias=b2_final)
out2 = ttnn.relu(out2)
# Layer 3
# Final layer with 10 output (for digits 0-9). No ReLU activation here, as
# this is the output layer.
W3_final = ttnn.transpose(W3, -2, -1)
b3_final = ttnn.reshape(b3, [1, -1])
out3 = ttnn.linear(out2, W3_final, bias=b3_final)
# Convert result back to torch
prediction = ttnn.to_torch(out3)
predicted_label = torch.argmax(prediction, dim=1).item()
correct += predicted_label == label.item()
total += 1
logger.info(f"Sample {i+1}: Predicted={predicted_label}, Actual={label.item()}")
logger.info(f"\nTT-NN MLP Inference Accuracy: {correct}/{total} = {100.0 * correct / total:.2f}%")
Close the Device
[ ]:
ttnn.close_device(device)
Full Example and Output
Lets put everything together in a complete example that can be run directly.
Run the following script to generate output:
$ python3 $TT_METAL_HOME/ttnn/tutorials/basic_python/ttnn_mlp_inference_mnist.py
2025-07-07 13:03:41.990 | info | SiliconDriver | Opened PCI device 7; KMD version: 1.34.0; API: 1; IOMMU: disabled (pci_device.cpp:198)
2025-07-07 13:03:41.992 | info | SiliconDriver | Opened PCI device 7; KMD version: 1.34.0; API: 1; IOMMU: disabled (pci_device.cpp:198)
2025-07-07 13:03:41.998 | info | Device | Opening user mode device driver (tt_cluster.cpp:190)
2025-07-07 13:03:41.998 | info | SiliconDriver | Opened PCI device 7; KMD version: 1.34.0; API: 1; IOMMU: disabled (pci_device.cpp:198)
2025-07-07 13:03:41.999 | info | SiliconDriver | Opened PCI device 7; KMD version: 1.34.0; API: 1; IOMMU: disabled (pci_device.cpp:198)
2025-07-07 13:03:42.006 | info | SiliconDriver | Opened PCI device 7; KMD version: 1.34.0; API: 1; IOMMU: disabled (pci_device.cpp:198)
2025-07-07 13:03:42.007 | info | SiliconDriver | Opened PCI device 7; KMD version: 1.34.0; API: 1; IOMMU: disabled (pci_device.cpp:198)
2025-07-07 13:03:42.013 | info | SiliconDriver | Harvesting mask for chip 0 is 0x100 (NOC0: 0x100, simulated harvesting mask: 0x0). (cluster.cpp:282)
2025-07-07 13:03:42.110 | info | SiliconDriver | Opened PCI device 7; KMD version: 1.34.0; API: 1; IOMMU: disabled (pci_device.cpp:198)
2025-07-07 13:03:42.172 | info | SiliconDriver | Opening local chip ids/pci ids: {0}/[7] and remote chip ids {} (cluster.cpp:147)
2025-07-07 13:03:42.182 | info | SiliconDriver | Software version 6.0.0, Ethernet FW version 6.14.0 (Device 0) (cluster.cpp:1039)
2025-07-07 13:03:42.268 | info | Metal | AI CLK for device 0 is: 1000 MHz (metal_context.cpp:128)
2025-07-07 13:03:42.886 | info | Metal | Initializing device 0. Program cache is enabled (device.cpp:428)
2025-07-07 13:03:42.888 | warning | Metal | Unable to bind worker thread to CPU Core. May see performance degradation. Error Code: 22 (hardware_command_queue.cpp:74)
2025-07-07 13:03:44.852 | INFO | __main__:main:32 - Loaded pretrained weights from mlp_mnist_weights.pt
2025-07-07 13:03:48.677 | INFO | __main__:main:87 - Sample 1: Predicted=7, Actual=7
2025-07-07 13:03:48.682 | INFO | __main__:main:87 - Sample 2: Predicted=2, Actual=2
2025-07-07 13:03:48.686 | INFO | __main__:main:87 - Sample 3: Predicted=1, Actual=1
2025-07-07 13:03:48.690 | INFO | __main__:main:87 - Sample 4: Predicted=0, Actual=0
2025-07-07 13:03:48.695 | INFO | __main__:main:87 - Sample 5: Predicted=4, Actual=4
2025-07-07 13:03:48.695 | INFO | __main__:main:89 -
TT-NN MLP Inference Accuracy: 5/5 = 100.00%
2025-07-07 13:03:48.695 | info | Metal | Closing mesh device 1 (mesh_device.cpp:488)
2025-07-07 13:03:48.696 | info | Metal | Closing mesh device 0 (mesh_device.cpp:488)
2025-07-07 13:03:48.696 | info | Metal | Closing device 0 (device.cpp:468)
2025-07-07 13:03:48.696 | info | Metal | Disabling and clearing program cache on device 0 (device.cpp:783)
2025-07-07 13:03:48.697 | info | Metal | Closing mesh device 1 (mesh_device.cpp:488)