Troubleshooting
Common failures, ordered by how often they bite. Each entry has the
symptom you see in kubectl, the root cause, and the fix.
ImagePullBackOff on the driver/flasher pods
NAME READY STATUS RESTARTS AGE
ttdrv-default-... 0/1 ImagePullBackOff 0 5m
$ kubectl -n tt-k8s-driver-manager-system describe pod ttdrv-...
... Failed to pull image ... HTTP 4xx / network unreachable ...
Image registries are public, so this is almost always one of:
Egress blocked to
ghcr.ioorpkg-containers.githubusercontent.com. Check your cluster’s proxy / firewall allowlist.Wrong image tag in the policy or chart values —
docker pull <image>:<tag>from a workstation to confirm the tag exists.Pull secret left over from a previous private-registry setup that no longer authenticates. Drop the secret from the ServiceAccount:
kubectl -n tt-k8s-driver-manager-system patch sa tt-k8s-driver-manager-installer --type=json -p='[{"op":"remove","path":"/imagePullSecrets"}]'.
“Pod is in use; cannot reinstall” — refcnt > 0
$ kubectl -n tt-k8s-driver-manager-system logs ttdrv-default-...
ERROR: tt-kmd 2.7.0 loaded with refcnt > 0; cannot reinstall 2.8.0
Holders: 12345 23456
Drain workloads holding /dev/tenstorrent and let the next reconcile retry.
A workload has /dev/tenstorrent/* open. The builder refuses to
rmmod (the kernel would refuse anyway with Module is in use).
Find the workloads:
$ kubectl get pods --all-namespaces -o json | jq -r '
.items[] | select(.spec.volumes[]? | .hostPath.path? == "/dev/tenstorrent")
| "\(.metadata.namespace)/\(.metadata.name) on \(.spec.nodeName)"'
Or use the PIDs from the log + kubectl debug node/X to
chroot /host ps -fp <pid>.
Drain those pods (kubectl delete pod for bare pods,
kubectl scale deploy --replicas=0 for Deployments). The next
reconcile will see refcnt=0 and proceed.
/bin/sh: 1: gcc-12: not found
warning: the compiler differs from the one used to build the kernel
The kernel was built by: x86_64-linux-gnu-gcc-13 ...
You are using: ...
/bin/sh: 1: gcc-12: not found
make[2]: *** [/tmp/.../module.o] Error 127
The kernel headers’ Makefile hardcodes a specific gcc-<version>
binary name (matching the gcc the kernel was compiled with). The
builder image ships gcc-12 by default — fine for Ubuntu 22.04 jammy +
HWE 6.x kernels. If your host’s kernel was compiled with a different
gcc, the build dies.
Check the host’s compiler:
$ cat /proc/version
Linux version 6.x.0-... (Ubuntu gcc-13 ...)
Two fixes:
Use a builder image built with the matching gcc. Bump
gcc-12→gcc-Ninimages/driver-build/Dockerfile, rebuild, push, pointdriver.imageat the new tag.Use a host with a different kernel. Match the kernel against the builder image —
apt install linux-image-generic-hwe-22.04pulls a jammy/gcc-12 kernel.
Long-term, the builder should detect the kernel’s compiler at runtime
and apt install the right gcc — open TODO.
host has no kernel build tree at /lib/modules/<kver>/build
ERROR: host has no kernel build tree at /lib/modules/6.8.0-111-generic/build
Install linux-headers-6.8.0-111-generic on the host.
The builder needs the host’s kernel headers to compile against. The host doesn’t have them installed (or has the wrong ones — e.g. headers for an older kernel after a kernel upgrade + reboot).
sudo apt install linux-headers-$(uname -r)
# Or, on HWE:
sudo apt install linux-headers-generic-hwe-22.04
After that, restart the failing pod:
kubectl -n tt-k8s-driver-manager-system delete pod -l app.kubernetes.io/component=driver \
--field-selector spec.nodeName=<that-node>
NFD label missing — node not picked up
$ kubectl get nodes -L feature.node.kubernetes.io/pci-1200_1e52.present
NAME STATUS PRESENT
node-1 Ready
(empty, but the node has a Tenstorrent card)
Causes:
NFD worker not running on that node. Check
kubectl get pod -l app.kubernetes.io/name=node-feature-discovery -A. The worker pod should beRunningon every Tenstorrent node.NFD’s PCI source disabled. Check the NFD chart’s
worker.config.core.featureSources— must includepci. The tt-operator umbrella sets this to["pci"](only); a separate NFD install might have it disabled.NFD’s deviceClassWhitelist excludes class 12. Class 1200 is in the default whitelist. If you narrowed it, add class 12 back.
If you want to bypass NFD entirely (dev clusters, kind), set on the controller deployment:
helm upgrade tt-k8s-driver-manager ... --set controller.requireTenstorrentLabel=false
The DaemonSet’s nodeAffinity will still require the label though, so
also kubectl label node <name> feature.node.kubernetes.io/pci-1200_1e52.present=true
on dev nodes. There’s a hack/dev/label-fake-tt-nodes.yaml for kind.
Module loaded but pod NotReady
$ kubectl -n tt-k8s-driver-manager-system get pod -l app.kubernetes.io/component=driver
ttdrv-default-... 0/1 Running 0 30s
Pod’s readiness probe checks /sys/module/tenstorrent/version == $TT_KMD_VERSION. If the loaded version doesn’t match what the CR
wants:
Pod is honest: it’s NotReady because the kernel isn’t at the declared state.
Most common cause:
rmmodfailed earlier (refcnt > 0) so the new.kowas built andinsmodwas called, butinsmodfailed silently because the module is already loaded with the same name.
Check the pod’s logs for the actual error. Then either drain workloads (refcnt → 0) or reboot the host.
Host has tt-kmd from DKMS/apt; operator ignored it
$ kubectl get node node-1 -L driver.tenstorrent.com/install-mode
NAME INSTALL-MODE
node-1 host
Expected, not a bug. The builder pod detected /var/lib/dkms/tenstorrent
or /usr/src/tenstorrent-<v>/dkms.conf and stood down. The host’s
tt-kmd stays, the operator doesn’t rmmod or rebuild.
If you want the operator to take over, follow Migrating from DKMS — it has the per-node vacate script, the cluster-side cordon/drain coordination, and the watch-outs (both DKMS signal dirs, refcnt > 0, proxied builder).
If you want to keep the host install and just stop driver-manager from touching the node entirely:
kubectl label node <name> driver.tenstorrent.com/skip=true
tt-smi can’t execute on host
$ tt-smi -s
tt-smi: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found
/usr/local/bin/tt-smi is the self-contained binary from tt-smi’s
GitHub releases, which ships one flavor per Ubuntu release
(tt-smi-<v>-ubuntu-22.04, -ubuntu-24.04, …). The builder image
bakes in the flavor matching its ARG UBUNTU_VERSION; if that’s newer
than the host OS, the binary needs glibc symbols the host doesn’t have.
Fix: set ARG UBUNTU_VERSION in images/driver-build/Dockerfile to
the hosts’ Ubuntu release and push. Mixed-OS fleets need one builder
image (and so one TenstorrentDriverPolicy with a matching
nodeAffinity) per Ubuntu release.
Builder pod can’t clone tt-kmd — proxy / DNS
$ kubectl -n tt-k8s-driver-manager-system logs ttdrv-default-...
fatal: unable to access 'https://github.com/tenstorrent/tt-kmd.git/':
Could not resolve host: github.com
Builder pod has no network path to GitHub for the cache-miss clone.
Common when pod egress goes through a proxy (CI behind squid, isolated
clusters) and HTTPS_PROXY isn’t set on the builder.
The controller propagates its own HTTPS_PROXY / HTTP_PROXY /
NO_PROXY env to spawned builder pods, so the fix is usually on the
controller side. Check that the controller pod itself has the proxy
env set:
kubectl -n tt-k8s-driver-manager-system get deploy \
tt-k8s-driver-manager-controller -o jsonpath='{.spec.template.spec.containers[0].env}' \
| jq '.[] | select(.name | test("PROXY"))'
If empty, set via Helm:
helm upgrade tt-k8s-driver-manager ... \
--set controller.extraEnv[0].name=HTTPS_PROXY \
--set controller.extraEnv[0].value=http://proxy.internal:3128 \
--set controller.extraEnv[1].name=NO_PROXY \
--set controller.extraEnv[1].value=10.0.0.0/8,.svc,.svc.cluster.local
The controller restart re-templates the builder DaemonSet with the new env; existing pods need a delete to re-roll.
Policy never matches — MessageExternalCordon on the CR
$ kubectl describe ttfp <name>
...
Status:
Per Node:
Name: node-1
State: Pending
Message: node is cordoned but not by this operator
The node has spec.unschedulable: true but no
firmware.tenstorrent.com/cordoned-by=<policy-name> (or the driver
equivalent driver.tenstorrent.com/cordoned-by) annotation. The
controller refuses to flash a node it didn’t cordon itself — protects
SREs who’ve taken nodes out of rotation for unrelated reasons.
Either uncordon and let the controller cordon it itself:
kubectl uncordon <node>
…or claim the existing cordon by stamping the annotation:
kubectl annotate node <node> \
firmware.tenstorrent.com/cordoned-by=<policy-name>
(Same pattern for driver policies, with driver.tenstorrent.com/cordoned-by.)
Alternative: set upgradePolicy.drain.enable: false on the CR to skip
the cordon gate entirely — flash Job will land via its universal
toleration regardless of cordon state. Loses the device-pod-eviction
safety.
Flash didn’t re-run after editing spec.flasher.image
You bumped TenstorrentFirmwarePolicy.spec.flasher.image (or
forceWrite, imagePullPolicy, etc.) and the controller did nothing.
Known limitation:
the per-node flash Job name is hashed on (CR name, node name, kmd version). The flasher fields are NOT in the hash. If a Job already
exists at that name with Completed status, the controller treats
the node as Done.
Workarounds until #42 ships a fix:
Bump
spec.version(forces a new Job name).Delete the per-node Job:
kubectl -n tt-k8s-driver-manager-system delete job -l driver.tenstorrent.com/cr=<name>,driver.tenstorrent.com/node=<node>.Wait out the 24h Job TTL — the next reconcile will spawn a fresh Job that picks up the new flasher fields.
Multiple policies match the same node — MessageNodeConflict
$ kubectl describe ttfp <name>
...
Message: node is also matched by another firmware policy
Two TenstorrentFirmwarePolicy (or two TenstorrentDriverPolicy) CRs
have nodeSelector overlap. Controller refuses to flash to avoid
racing each other.
Find the offenders:
kubectl get ttfp -o json | jq -r '
.items[] | {name: .metadata.name, selector: .spec.nodeSelector}'
Narrow one selector so each node matches exactly one CR. Typical
mistake: a “wildcard” CR (empty nodeSelector) sitting next to a
team-scoped CR. Add a matchExpressions NotIn on the wildcard or
delete it.
CR keeps re-flashing despite node being at the right version
The firmware controller’s “this node is done” signal is a Complete
Job in its own namespace. If you (a) moved the controller to a new
namespace, or (b) deleted all old flash Jobs, the controller has no
record that the node was flashed and re-runs.
Pre-flash readback (inside the flasher Job) will report the existing
version matches the desired, and tt-flash with --force=false will
no-op. So it’s harmless, just slow. The node’s current-version
annotation is the cheaper source of truth — bumping the controller to
read it as a tiebreaker is open work.
Controller is in a tight reconcile loop
kubectl logs -n tt-k8s-driver-manager-system deploy/tt-k8s-driver-manager-controller
shows “updated tt-kmd DaemonSet” multiple times per second.
Old symptom (fixed): the diff predicate compared image but missed
other template fields. If you see this on a current version, file a
bug — the driver.tenstorrent.com/template-hash annotation on the DS
should be making this impossible.
Fully clean a host
When a node has been managed by both this operator and a host-side DKMS installer (or earlier non-container installer pods) and you want to start fresh:
# As root on the node:
rmmod tenstorrent # may fail if refcnt>0
rm -rf /usr/src/tenstorrent-*
rm -rf /var/lib/dkms/tenstorrent
rm -f /lib/modules/*/updates/dkms/tenstorrent.ko*
rm -f /etc/modules-load.d/tenstorrent.conf
rm -rf /opt/tt
rm -f /usr/local/bin/tt-smi
rm -rf /var/cache/tt-kmd/
for kdir in /lib/modules/*/; do depmod -a "$(basename $kdir)"; done
After reboot, the host should have no trace of tt-kmd, and the builder pod will fall through to its build path on first reconcile.
This can also be scripted via kubectl debug node/<n> --profile=sysadmin -- chroot /host bash -c '<script>' so you don’t need SSH access.
When all else fails
kubectl -n tt-k8s-driver-manager-system logs deploy/tt-k8s-driver-manager-controller --tail=100
kubectl -n tt-k8s-driver-manager-system get events --sort-by=.lastTimestamp | tail -20
kubectl -n tt-k8s-driver-manager-system describe ttdp <name>
kubectl describe node <name>
Most issues are visible in one of those four. File a bug with:
The CR’s full YAML (
kubectl get ttdp <name> -o yaml)Affected node’s labels (
kubectl get node <name> -o yaml | grep -A20 labels)The failing pod’s logs (current and
--previous)The events from the operator namespace.