Breaking & Fixing Things
The QB2 is a workstation, not a cloud VM with a reset button in a web console. Software changes are reversible. Hardware is physically robust. The worst case scenario for almost everything in this chapter is a few minutes of diagnostic work and a single command.
This philosophy matters. It means you should experiment freely. Break things. Learn the recovery pattern. The machine can take it.
tt-smi is the first diagnostic for anything chip-related. Before filing a bug or posting to Discord, run tt-smi -s and include the JSON output. It answers half the questions before they’re asked.
The Recovery Ladder
Before reaching for dramatic solutions, follow this order:
- Run the failing command again (transient errors happen)
- Check the relevant log:
journalctl,docker logs, or the service’s own log file - Restart the failing service:
systemctl restart <name>ordocker restart <container> - Reload the driver:
sudo modprobe -r tenstorrent && sudo modprobe tenstorrent - Reboot:
sudo reboot - Post to the Tenstorrent Discord with
tt-smi -soutput
The vast majority of issues resolve at step 1, 2, or 3.
Common Breakage Patterns
1. pip install in the Wrong Environment
Symptom: Something installed correctly but broke an import in tt-metal, or pip install warned about an externally-managed environment.
Cause: Installing into system Python, or into a tt-metal venv that shouldn’t be modified.
Fix:
# Identify which pip you used
which pip # or: which pip3
# If it was system pip, uninstall the conflicting package
pip uninstall <package-name>
# If you modified a tt-metal venv, recreate it:
# First, note the requirements file for that venv, then:
rm -rf ~/tt-metal/python_env
# Re-run the tt-metal environment setup script
# (check ~/tt-metal/README.md for the exact command)
Going forward, always activate a project-specific venv before installing packages. Never use pip install --break-system-packages unless you have a specific reason.
2. conda Conflict with tt-metal
Symptom: Conda activated, model fails to load, TTNN throws import errors about library version mismatches.
Cause: Conda replaces system libraries in PATH, breaking tt-metal’s pinned dependencies.
Fix:
# Deactivate conda
conda deactivate
# Remove conda from PATH for this session
unset CONDA_DEFAULT_ENV
# Long-term: add this to your .bashrc AFTER the conda init block
# to prevent auto-activation:
conda config --set auto_activate_base false
Keep tt-metal environments and conda environments in separate shell sessions. They do not coexist gracefully.
3. “No Devices Found” After a Kernel Upgrade
Symptom: tt-smi returns no devices. lsmod | grep tenstorrent shows nothing.
Cause: A kernel upgrade installed a new kernel without re-building or loading the Tenstorrent driver for it.
Fix:
# Check if the driver module exists for the current kernel
ls /lib/modules/$(uname -r)/extra/ | grep tenstorrent
# If missing, reinstall the tt-metal driver package
sudo apt install --reinstall tt-firmware # adjust package name as needed
# or re-run the tt-installer if you used that for initial setup
# Reload the driver
sudo modprobe tenstorrent
# Verify
tt-smi -s
If sudo modprobe tenstorrent fails with “module not found”, the driver isn’t built for the current kernel. You need to either roll back the kernel or rebuild the driver. Check the tt-metal GitHub for the currently supported kernel range.
4. Model Download Corrupted Mid-Way
Symptom: Model fails to load. Error messages about unexpected EOF or missing shards.
Fix:
# The Hugging Face CLI supports resumable downloads
huggingface-cli download <model-id> \
--local-dir ~/models/<model-name> \
--resume-download
# Example for Llama-3.1-8B:
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
--local-dir ~/models/Llama-3.1-8B-Instruct \
--resume-download
If the download is severely corrupted, delete the partial directory and start fresh:
rm -rf ~/models/<model-name>
huggingface-cli download <model-id> --local-dir ~/models/<model-name>
5. OOM During Inference
Symptom: Python process crashes with out-of-memory error during model load or inference. The model may be too large for the chip DRAM, or you’re only using one chip for a model that needs four.
Fix:
# Verify you're using all four chips for large models
python3 -m vllm.entrypoints.openai.api_server \
--model ~/models/Llama-3.1-70B-Instruct \
--num_gpus 4 \ # this is required for 70B
--port 8000
# For a smaller model that fits on fewer chips
python3 -m vllm.entrypoints.openai.api_server \
--model ~/models/Qwen3-0.6B \
--num_gpus 1 \
--port 8000
Also check that no other process is holding chip memory:
tt-smi -s | python3 -c "
import json, sys
d = json.load(sys.stdin)
for chip in d.get('device_info', []):
print(chip.get('board_id', '?'), '— mem used:', chip.get('dram_usage', '?'))
"
6. Docker or tt-inference-server Won’t Start
Symptom: docker ps hangs or errors; tt-inference-server container fails to launch.
tt-installer v3.2.0+ installs Docker by default. If you chose Podman instead (
--install-container-runtime=podman), substitutepodmanfordockerin the commands below.
Fix:
# Check if Docker daemon is running
systemctl status docker
# Start it if not
sudo systemctl start docker
# Check running containers
docker ps -a
# View logs from the last failed container
docker logs $(docker ps -a -q --filter "status=exited" | head -1)
# Hard-restart a specific container
docker stop <container-name> && docker start <container-name>
# If the container is corrupted, remove and repull
docker rm <container-name>
docker pull <image-name>
# Then re-run the server start command
7. tt-toplike Crashes at Startup
Symptom: tt-toplike exits immediately or produces a panic/error message.
Cause: Almost always a driver issue — tt-toplike can’t see the chips.
Fix:
# Verify chips are visible first
tt-smi -s
# If tt-smi also fails, reload the driver
sudo modprobe -r tenstorrent
sudo modprobe tenstorrent
tt-smi -s # should now show four devices
# Then retry
tt-toplike --mode normal
If tt-smi -s works but tt-toplike still fails, reinstall it from GitHub releases or via cargo:
# tt-toplike is not in the Tenstorrent apt PPA — reinstall from:
# https://github.com/tenstorrent/tt-toplike/releases
sudo dpkg -i tt-toplike_*.deb
# Or: cargo install tt-toplike --force
The tenstorrent kernel module is a loadable driver. If it was loaded for kernel 6.x.y and you’re now on 6.x.z, it may need to be rebuilt or reinstalled. dmesg | grep tenstorrent is your friend here — it shows exactly why the module failed to load.