Troubleshooting

Common failures, ordered by how often they bite. Each entry has the symptom you see in kubectl, the root cause, and the fix.

ImagePullBackOff on the driver/flasher pods

NAME                  READY   STATUS             RESTARTS   AGE
ttdrv-default-...     0/1     ImagePullBackOff   0          5m
$ kubectl -n tt-k8s-driver-manager-system describe pod ttdrv-...
... Failed to pull image ... HTTP 4xx / network unreachable ...

Image registries are public, so this is almost always one of:

  • Egress blocked to ghcr.io or pkg-containers.githubusercontent.com. Check your cluster’s proxy / firewall allowlist.

  • Wrong image tag in the policy or chart values — docker pull <image>:<tag> from a workstation to confirm the tag exists.

  • Pull secret left over from a previous private-registry setup that no longer authenticates. Drop the secret from the ServiceAccount: kubectl -n tt-k8s-driver-manager-system patch sa tt-k8s-driver-manager-installer --type=json -p='[{"op":"remove","path":"/imagePullSecrets"}]'.

“Pod is in use; cannot reinstall” — refcnt > 0

$ kubectl -n tt-k8s-driver-manager-system logs ttdrv-default-...
ERROR: tt-kmd 2.7.0 loaded with refcnt > 0; cannot reinstall 2.8.0
Holders: 12345 23456
Drain workloads holding /dev/tenstorrent and let the next reconcile retry.

A workload has /dev/tenstorrent/* open. The builder refuses to rmmod (the kernel would refuse anyway with Module is in use).

Find the workloads:

$ kubectl get pods --all-namespaces -o json | jq -r '
  .items[] | select(.spec.volumes[]? | .hostPath.path? == "/dev/tenstorrent")
  | "\(.metadata.namespace)/\(.metadata.name) on \(.spec.nodeName)"'

Or use the PIDs from the log + kubectl debug node/X to chroot /host ps -fp <pid>.

Drain those pods (kubectl delete pod for bare pods, kubectl scale deploy --replicas=0 for Deployments). The next reconcile will see refcnt=0 and proceed.

/bin/sh: 1: gcc-12: not found

warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-13 ...
  You are using:           ...
/bin/sh: 1: gcc-12: not found
make[2]: *** [/tmp/.../module.o] Error 127

The kernel headers’ Makefile hardcodes a specific gcc-<version> binary name (matching the gcc the kernel was compiled with). The builder image ships gcc-12 by default — fine for Ubuntu 22.04 jammy + HWE 6.x kernels. If your host’s kernel was compiled with a different gcc, the build dies.

Check the host’s compiler:

$ cat /proc/version
Linux version 6.x.0-... (Ubuntu gcc-13 ...)

Two fixes:

  1. Use a builder image built with the matching gcc. Bump gcc-12gcc-N in images/driver-build/Dockerfile, rebuild, push, point driver.image at the new tag.

  2. Use a host with a different kernel. Match the kernel against the builder image — apt install linux-image-generic-hwe-22.04 pulls a jammy/gcc-12 kernel.

Long-term, the builder should detect the kernel’s compiler at runtime and apt install the right gcc — open TODO.

host has no kernel build tree at /lib/modules/<kver>/build

ERROR: host has no kernel build tree at /lib/modules/6.8.0-111-generic/build
Install linux-headers-6.8.0-111-generic on the host.

The builder needs the host’s kernel headers to compile against. The host doesn’t have them installed (or has the wrong ones — e.g. headers for an older kernel after a kernel upgrade + reboot).

sudo apt install linux-headers-$(uname -r)
# Or, on HWE:
sudo apt install linux-headers-generic-hwe-22.04

After that, restart the failing pod:

kubectl -n tt-k8s-driver-manager-system delete pod -l app.kubernetes.io/component=driver \
  --field-selector spec.nodeName=<that-node>

NFD label missing — node not picked up

$ kubectl get nodes -L feature.node.kubernetes.io/pci-1200_1e52.present
NAME      STATUS   PRESENT
node-1   Ready

(empty, but the node has a Tenstorrent card)

Causes:

  1. NFD worker not running on that node. Check kubectl get pod -l app.kubernetes.io/name=node-feature-discovery -A. The worker pod should be Running on every Tenstorrent node.

  2. NFD’s PCI source disabled. Check the NFD chart’s worker.config.core.featureSources — must include pci. The tt-operator umbrella sets this to ["pci"] (only); a separate NFD install might have it disabled.

  3. NFD’s deviceClassWhitelist excludes class 12. Class 1200 is in the default whitelist. If you narrowed it, add class 12 back.

If you want to bypass NFD entirely (dev clusters, kind), set on the controller deployment:

helm upgrade tt-k8s-driver-manager ... --set controller.requireTenstorrentLabel=false

The DaemonSet’s nodeAffinity will still require the label though, so also kubectl label node <name> feature.node.kubernetes.io/pci-1200_1e52.present=true on dev nodes. There’s a hack/dev/label-fake-tt-nodes.yaml for kind.

Module loaded but pod NotReady

$ kubectl -n tt-k8s-driver-manager-system get pod -l app.kubernetes.io/component=driver
ttdrv-default-...   0/1     Running   0   30s

Pod’s readiness probe checks /sys/module/tenstorrent/version == $TT_KMD_VERSION. If the loaded version doesn’t match what the CR wants:

  • Pod is honest: it’s NotReady because the kernel isn’t at the declared state.

  • Most common cause: rmmod failed earlier (refcnt > 0) so the new .ko was built and insmod was called, but insmod failed silently because the module is already loaded with the same name.

Check the pod’s logs for the actual error. Then either drain workloads (refcnt → 0) or reboot the host.

Host has tt-kmd from DKMS/apt; operator ignored it

$ kubectl get node node-1 -L driver.tenstorrent.com/install-mode
NAME     INSTALL-MODE
node-1   host

Expected, not a bug. The builder pod detected /var/lib/dkms/tenstorrent or /usr/src/tenstorrent-<v>/dkms.conf and stood down. The host’s tt-kmd stays, the operator doesn’t rmmod or rebuild.

If you want the operator to take over, follow Migrating from DKMS — it has the per-node vacate script, the cluster-side cordon/drain coordination, and the watch-outs (both DKMS signal dirs, refcnt > 0, proxied builder).

If you want to keep the host install and just stop driver-manager from touching the node entirely:

kubectl label node <name> driver.tenstorrent.com/skip=true

tt-smi can’t execute on host

$ tt-smi -s
tt-smi: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found

/usr/local/bin/tt-smi is the self-contained binary from tt-smi’s GitHub releases, which ships one flavor per Ubuntu release (tt-smi-<v>-ubuntu-22.04, -ubuntu-24.04, …). The builder image bakes in the flavor matching its ARG UBUNTU_VERSION; if that’s newer than the host OS, the binary needs glibc symbols the host doesn’t have.

Fix: set ARG UBUNTU_VERSION in images/driver-build/Dockerfile to the hosts’ Ubuntu release and push. Mixed-OS fleets need one builder image (and so one TenstorrentDriverPolicy with a matching nodeAffinity) per Ubuntu release.

Builder pod can’t clone tt-kmd — proxy / DNS

$ kubectl -n tt-k8s-driver-manager-system logs ttdrv-default-...
fatal: unable to access 'https://github.com/tenstorrent/tt-kmd.git/':
  Could not resolve host: github.com

Builder pod has no network path to GitHub for the cache-miss clone. Common when pod egress goes through a proxy (CI behind squid, isolated clusters) and HTTPS_PROXY isn’t set on the builder.

The controller propagates its own HTTPS_PROXY / HTTP_PROXY / NO_PROXY env to spawned builder pods, so the fix is usually on the controller side. Check that the controller pod itself has the proxy env set:

kubectl -n tt-k8s-driver-manager-system get deploy \
  tt-k8s-driver-manager-controller -o jsonpath='{.spec.template.spec.containers[0].env}' \
  | jq '.[] | select(.name | test("PROXY"))'

If empty, set via Helm:

helm upgrade tt-k8s-driver-manager ... \
  --set controller.extraEnv[0].name=HTTPS_PROXY \
  --set controller.extraEnv[0].value=http://proxy.internal:3128 \
  --set controller.extraEnv[1].name=NO_PROXY \
  --set controller.extraEnv[1].value=10.0.0.0/8,.svc,.svc.cluster.local

The controller restart re-templates the builder DaemonSet with the new env; existing pods need a delete to re-roll.

Policy never matches — MessageExternalCordon on the CR

$ kubectl describe ttfp <name>
...
Status:
  Per Node:
    Name:     node-1
    State:    Pending
    Message:  node is cordoned but not by this operator

The node has spec.unschedulable: true but no firmware.tenstorrent.com/cordoned-by=<policy-name> (or the driver equivalent driver.tenstorrent.com/cordoned-by) annotation. The controller refuses to flash a node it didn’t cordon itself — protects SREs who’ve taken nodes out of rotation for unrelated reasons.

Either uncordon and let the controller cordon it itself:

kubectl uncordon <node>

…or claim the existing cordon by stamping the annotation:

kubectl annotate node <node> \
  firmware.tenstorrent.com/cordoned-by=<policy-name>

(Same pattern for driver policies, with driver.tenstorrent.com/cordoned-by.)

Alternative: set upgradePolicy.drain.enable: false on the CR to skip the cordon gate entirely — flash Job will land via its universal toleration regardless of cordon state. Loses the device-pod-eviction safety.

Flash didn’t re-run after editing spec.flasher.image

You bumped TenstorrentFirmwarePolicy.spec.flasher.image (or forceWrite, imagePullPolicy, etc.) and the controller did nothing.

Known limitation: the per-node flash Job name is hashed on (CR name, node name, kmd version). The flasher fields are NOT in the hash. If a Job already exists at that name with Completed status, the controller treats the node as Done.

Workarounds until #42 ships a fix:

  • Bump spec.version (forces a new Job name).

  • Delete the per-node Job: kubectl -n tt-k8s-driver-manager-system delete job -l driver.tenstorrent.com/cr=<name>,driver.tenstorrent.com/node=<node>.

  • Wait out the 24h Job TTL — the next reconcile will spawn a fresh Job that picks up the new flasher fields.

Multiple policies match the same node — MessageNodeConflict

$ kubectl describe ttfp <name>
...
    Message:  node is also matched by another firmware policy

Two TenstorrentFirmwarePolicy (or two TenstorrentDriverPolicy) CRs have nodeSelector overlap. Controller refuses to flash to avoid racing each other.

Find the offenders:

kubectl get ttfp -o json | jq -r '
  .items[] | {name: .metadata.name, selector: .spec.nodeSelector}'

Narrow one selector so each node matches exactly one CR. Typical mistake: a “wildcard” CR (empty nodeSelector) sitting next to a team-scoped CR. Add a matchExpressions NotIn on the wildcard or delete it.

CR keeps re-flashing despite node being at the right version

The firmware controller’s “this node is done” signal is a Complete Job in its own namespace. If you (a) moved the controller to a new namespace, or (b) deleted all old flash Jobs, the controller has no record that the node was flashed and re-runs.

Pre-flash readback (inside the flasher Job) will report the existing version matches the desired, and tt-flash with --force=false will no-op. So it’s harmless, just slow. The node’s current-version annotation is the cheaper source of truth — bumping the controller to read it as a tiebreaker is open work.

Controller is in a tight reconcile loop

kubectl logs -n tt-k8s-driver-manager-system deploy/tt-k8s-driver-manager-controller shows “updated tt-kmd DaemonSet” multiple times per second.

Old symptom (fixed): the diff predicate compared image but missed other template fields. If you see this on a current version, file a bug — the driver.tenstorrent.com/template-hash annotation on the DS should be making this impossible.

Fully clean a host

When a node has been managed by both this operator and a host-side DKMS installer (or earlier non-container installer pods) and you want to start fresh:

# As root on the node:
rmmod tenstorrent                                      # may fail if refcnt>0
rm -rf /usr/src/tenstorrent-*
rm -rf /var/lib/dkms/tenstorrent
rm -f /lib/modules/*/updates/dkms/tenstorrent.ko*
rm -f /etc/modules-load.d/tenstorrent.conf
rm -rf /opt/tt
rm -f /usr/local/bin/tt-smi
rm -rf /var/cache/tt-kmd/
for kdir in /lib/modules/*/; do depmod -a "$(basename $kdir)"; done

After reboot, the host should have no trace of tt-kmd, and the builder pod will fall through to its build path on first reconcile.

This can also be scripted via kubectl debug node/<n> --profile=sysadmin -- chroot /host bash -c '<script>' so you don’t need SSH access.

When all else fails

kubectl -n tt-k8s-driver-manager-system logs deploy/tt-k8s-driver-manager-controller --tail=100
kubectl -n tt-k8s-driver-manager-system get events --sort-by=.lastTimestamp | tail -20
kubectl -n tt-k8s-driver-manager-system describe ttdp <name>
kubectl describe node <name>

Most issues are visible in one of those four. File a bug with:

  • The CR’s full YAML (kubectl get ttdp <name> -o yaml)

  • Affected node’s labels (kubectl get node <name> -o yaml | grep -A20 labels)

  • The failing pod’s logs (current and --previous)

  • The events from the operator namespace.