Deploying LLMs

This page demonstrates how to deploy LLMs using the tt-inference-server project. We currently use vLLM to serve LLMs for production applications. It is also a convenient entry-point into Tenstorrent’s software ecosystem. You will learn how to prepare your Tenstorrent system, configure access to gated models on Hugging Face, and deploy a vLLM-powered API endpoint using the tt-inference-server project.

Before You Begin

Before beginning this procedure, ensure that you have completed the base software installation. This process has specific system and hardware requirements.

Important

This guide assumes that you have already followed the Installing the Tenstorrent Software Stack guide.
Deploying the recommended models requires a minimum of 360 GB of free disk space in your root partition.

(Wormhole™ only)

If you are using a TT-QuietBox™ or TT-LoudBox™ you must complete the following step

For these systems you must configure a system-level mesh topology between the Wormhole™ Networked AI Processors. Run the following script to install tt-topology and configure the mesh.

TMP_DIR=$(mktemp -d); (trap 'echo "---"; echo "Cleaning up..."; if type deactivate &>/dev/null; then deactivate; fi; echo "Removing temporary directory: $TMP_DIR"; rm -rf "$TMP_DIR"; cd; echo "Cleanup complete."' EXIT; trap 'echo -e "\033[0;31m!!! ERROR: Failed to configure mesh topology\033[0m"' ERR; set -e; cd "$TMP_DIR"; echo "Working in temporary directory: $TMP_DIR"; echo "---"; echo "Creating Python virtual environment..."; python3 -m venv tt-topology-venv; source tt-topology-venv/bin/activate; echo "Virtual environment activated."; echo "---"; echo "Installing tt-topology from git..."; pip install --quiet git+https://github.com/tenstorrent/tt-topology.git; echo "tt-topology installed."; echo "---"; echo "Running tt-topology command. This may take a moment..."; tt-topology -l mesh; echo "---"; echo "Script finished successfully.";)

Installing Docker

tt-inference-server requires Docker to be installed. To install Docker, please follow the official installation instructions.

Verify that the installation was successful by running the hello-world image:

sudo docker run hello-world

You must also follow the official post-installation instructions to run Docker without root permissions.

Verify that the post-installation was successful by running the hello-world image again, this time without root permissions:

docker run hello-world

Important

Do not continue if you cannot run the hello-world image without root permissions, be sure to follow all post-installation instructions.

Step 1: Getting Model Access on Hugging Face

The recommended large language models are gated and require a Hugging Face account.

1. Request Access to the Model

Visit the model’s page on Hugging Face and follow the instructions to request access.

For TT-QuietBox™ and TT-LoudBox™ systems, we recommend meta-llama/Llama-3.3-70B-Instruct
For add-in-card products (n-series, p-series), we recommend meta-llama/Llama-3.1-8B-Instruct

For a full list of the currently available and tested models, please visit the tt-inference-server GitHub page.

Important

Access is granted by the model owner and is not controlled by Tenstorrent. This process may take several days.

2. Create a Hugging Face Access Token

Once you have access, generate an access token with a minimum of read permissions. This token is required to download the model’s weights from Hugging Face.

3. Export the Token

On the system where you will deploy the server, export your token as an environment variable.

export HF_TOKEN="<your-hugging-face-access-token>"

Step 2: Configuring the vLLM Server

1. Select Your Hardware

Run the following script to specify your hardware. This script sets the required environment variables and selects the recommended model for your system.

select_device_and_model(){ echo -e "\nSelect a Tenstorrent system from the list below:"; PS3=$'\n#? '; options=("TT-QuietBox (Wormhole)" "TT-QuietBox (Blackhole)" "TT-LoudBox (Wormhole)" "TT-LoudBox (Blackhole)" "n150s" "n150d" "n300s" "n300d" "p100a" "p150a" "p150b" "Quit"); select opt in "${options[@]}"; do case "$opt" in "TT-QuietBox (Wormhole)") DEVICE="t3k"; MODEL="Llama-3.3-70B-Instruct";; "TT-QuietBox (Blackhole)") DEVICE="p150x4"; MODEL="Llama-3.3-70B-Instruct";; "TT-LoudBox (Wormhole)") DEVICE="t3k"; MODEL="Llama-3.3-70B-Instruct";; "TT-LoudBox (Blackhole)") DEVICE="p150x8"; MODEL="Llama-3.3-70B-Instruct";; "n150s"|"n150d") DEVICE="n150"; MODEL="Llama-3.1-8B-Instruct";; "n300s"|"n300d") DEVICE="n300"; MODEL="Llama-3.1-8B-Instruct";; "p100a") DEVICE="p100"; MODEL="Llama-3.1-8B-Instruct";; "p150a"|"p150b") DEVICE="p150"; MODEL="Llama-3.1-8B-Instruct";; "Quit") echo "❌ Exiting without setting any environment variables."; return;; *) echo "❌ Invalid option. Try again."; continue; esac; export DEVICE MODEL; echo -e "\n✅ DEVICE set to '$DEVICE'"; echo "✅ MODEL set to '$MODEL'"; break; done; }; select_device_and_model

2. Check Model Access

Execute this script to confirm you can access the recommended model’s weights:

check_hf_access() { [ -z "$MODEL" ] && { printf "✖ Error: Please provide a Hugging Face repository ID.\n"; return 1; }; ! command -v curl &>/dev/null && { printf "✖ Error: curl is not installed.\n"; return 1; }; local REPO_ID="meta-llama/$MODEL"; local TOKEN=${HF_TOKEN:-$(cat "$HOME/.cache/huggingface/token" 2>/dev/null)}; [ -z "$TOKEN" ] && printf "ℹ️ Info: No Hugging Face token found.\n   You can only access public repositories.\n"; local AUTH_HEADER=""; [ -n "$TOKEN" ] && AUTH_HEADER="Authorization: Bearer $TOKEN"; printf "Checking access for: %s...\n" "$REPO_ID"; local URL="https://huggingface.co/$REPO_ID/resolve/main/config.json"; local HTTP_CODE=$(curl -s -L -o /dev/null -w "%{http_code}" -H "$AUTH_HEADER" "$URL"); case $HTTP_CODE in 200) printf "✔ Access granted.\n";; 401) printf "✖ Access denied (401 Unauthorized).\n  This is a private or gated repository.\n  Ensure your token is valid and has the correct permissions.\n";; 403) printf "✖ Access forbidden (403 Forbidden).\n  The repository is gated.\n  You need to visit the repository page on Hugging Face and request access.\n";; 404) printf "✖ Repository or 'config.json' not found (404 Not Found).\n  Please check if the repository ID '$REPO_ID' is correct.\n";; *) printf "✖ Failed to check access.\n  Received HTTP status code: %s\n" "$HTTP_CODE";; esac; }; HF_HUB_DISABLE_XET=1; check_hf_access;

If the command does not succeed and print ✔ Access granted., please make sure you have exported your Hugging Face token as per the above instructions.

3. Clone the Server Repository

git clone https://github.com/tenstorrent/tt-inference-server.git
cd tt-inference-server

Step 3: Launching the vLLM Server

1. Set the JWT Secret

This string is used to seed the generation of your server’s API key.

export JWT_SECRET="testing"

2. Run the Deployment Script

Execute the following command. The script prompts you for configuration details; in most cases, you may accept the default values.

python3 run.py --model $MODEL --device $DEVICE --workflow server --docker-server

Important

The first time you run this command, it will download the model’s weights. This download can take more than 30 minutes.

3. Wait for the server to initialize

After the download completes, the server will start initializing in a docker container.

Important

The first time you start the server, the initialization process for a 70B model should take about 40 minutes. For an 8B model it should take about 10 minutes.

Step 4: Testing the Server Endpoint

1. Check Server Health

Use the following command to check if the server is ready.

check_server_health(){ code=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8000/health); exit_code=$?; if [[ $exit_code -ne 0 ]]; then echo "❌ Error: Unable to connect to server at localhost:8000"; elif [[ $code -eq 200 ]]; then echo "✅ Server is ready (HTTP 200)"; else echo "⚠️ Server responded with status: $code"; fi; }; check_server_health

Wait until the output indicates ✅ Server is ready (HTTP 200).

2. Generate an API Key

Your JWT_SECRET is used to create an API key for authenticating requests.

python3 -m venv request-venv
source request-venv/bin/activate
pip3 install --upgrade pip
pip install pyjwt==2.7.0
export VLLM_API_KEY=$(python3 -c 'import os; import json; import jwt; json_payload = json.loads("{\"team_id\": \"tenstorrent\", \"token_id\": \"debug-test\"}"); encoded_jwt = jwt.encode(json_payload, os.environ["JWT_SECRET"], algorithm="HS256"); print(encoded_jwt)')

3. Send an Example Request

The vLLM server exposes an OpenAI-compatible API. The first request will be slow as it performs a final warmup.

# Warmup request
curl -sS "http://localhost:8000/v1/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d "{
    \"model\": \"meta-llama/$MODEL\",
    \"prompt\": \"San Francisco is a\",
    \"max_tokens\": 50,
    \"temperature\": 0
  }" | jq

Run the command again to observe the server respond at full speed.

Need Additional Support?

If you encounter any issues, or have a question that isn’t covered in the documentation, please raise a support request. Our team will review your request and provide assistance.