# Troubleshooting Common failures, ordered by how often they bite. Each entry has the symptom you see in `kubectl`, the root cause, and the fix. ## ImagePullBackOff on the driver/flasher pods ``` NAME READY STATUS RESTARTS AGE ttdrv-default-... 0/1 ImagePullBackOff 0 5m ``` ```bash $ kubectl -n tt-k8s-driver-manager-system describe pod ttdrv-... ... Failed to pull image ... HTTP 4xx / network unreachable ... ``` Image registries are public, so this is almost always one of: - **Egress blocked** to `ghcr.io` or `pkg-containers.githubusercontent.com`. Check your cluster's proxy / firewall allowlist. - **Wrong image tag** in the policy or chart values — `docker pull :` from a workstation to confirm the tag exists. - **Pull secret left over** from a previous private-registry setup that no longer authenticates. Drop the secret from the ServiceAccount: `kubectl -n tt-k8s-driver-manager-system patch sa tt-k8s-driver-manager-installer --type=json -p='[{"op":"remove","path":"/imagePullSecrets"}]'`. ## "Pod is in use; cannot reinstall" — refcnt > 0 ``` $ kubectl -n tt-k8s-driver-manager-system logs ttdrv-default-... ERROR: tt-kmd 2.7.0 loaded with refcnt > 0; cannot reinstall 2.8.0 Holders: 12345 23456 Drain workloads holding /dev/tenstorrent and let the next reconcile retry. ``` A workload has `/dev/tenstorrent/*` open. The builder refuses to `rmmod` (the kernel would refuse anyway with `Module is in use`). Find the workloads: ```bash $ kubectl get pods --all-namespaces -o json | jq -r ' .items[] | select(.spec.volumes[]? | .hostPath.path? == "/dev/tenstorrent") | "\(.metadata.namespace)/\(.metadata.name) on \(.spec.nodeName)"' ``` Or use the PIDs from the log + `kubectl debug node/X` to `chroot /host ps -fp `. Drain those pods (`kubectl delete pod` for bare pods, `kubectl scale deploy --replicas=0` for Deployments). The next reconcile will see `refcnt=0` and proceed. ## `/bin/sh: 1: gcc-12: not found` ``` warning: the compiler differs from the one used to build the kernel The kernel was built by: x86_64-linux-gnu-gcc-13 ... You are using: ... /bin/sh: 1: gcc-12: not found make[2]: *** [/tmp/.../module.o] Error 127 ``` The kernel headers' Makefile hardcodes a specific `gcc-` binary name (matching the gcc the kernel was compiled with). The builder image ships gcc-12 by default — fine for Ubuntu 22.04 jammy + HWE 6.x kernels. If your host's kernel was compiled with a different gcc, the build dies. Check the host's compiler: ```bash $ cat /proc/version Linux version 6.x.0-... (Ubuntu gcc-13 ...) ``` Two fixes: 1. **Use a builder image built with the matching gcc.** Bump `gcc-12` → `gcc-N` in `images/driver-build/Dockerfile`, rebuild, push, point `driver.image` at the new tag. 2. **Use a host with a different kernel.** Match the kernel against the builder image — `apt install linux-image-generic-hwe-22.04` pulls a jammy/gcc-12 kernel. Long-term, the builder should detect the kernel's compiler at runtime and `apt install` the right gcc — open TODO. ## `host has no kernel build tree at /lib/modules//build` ``` ERROR: host has no kernel build tree at /lib/modules/6.8.0-111-generic/build Install linux-headers-6.8.0-111-generic on the host. ``` The builder needs the host's kernel headers to compile against. The host doesn't have them installed (or has the wrong ones — e.g. headers for an older kernel after a kernel upgrade + reboot). ```bash sudo apt install linux-headers-$(uname -r) # Or, on HWE: sudo apt install linux-headers-generic-hwe-22.04 ``` After that, restart the failing pod: ```bash kubectl -n tt-k8s-driver-manager-system delete pod -l app.kubernetes.io/component=driver \ --field-selector spec.nodeName= ``` ## NFD label missing — node not picked up ```bash $ kubectl get nodes -L feature.node.kubernetes.io/pci-1200_1e52.present NAME STATUS PRESENT node-1 Ready ``` (empty, but the node has a Tenstorrent card) Causes: 1. **NFD worker not running on that node.** Check `kubectl get pod -l app.kubernetes.io/name=node-feature-discovery -A`. The worker pod should be `Running` on every Tenstorrent node. 2. **NFD's PCI source disabled.** Check the NFD chart's `worker.config.core.featureSources` — must include `pci`. The tt-operator umbrella sets this to `["pci"]` (only); a separate NFD install might have it disabled. 3. **NFD's deviceClassWhitelist excludes class 12.** Class 1200 is in the default whitelist. If you narrowed it, add class 12 back. If you want to bypass NFD entirely (dev clusters, kind), set on the controller deployment: ```bash helm upgrade tt-k8s-driver-manager ... --set controller.requireTenstorrentLabel=false ``` The DaemonSet's `nodeAffinity` will still require the label though, so also `kubectl label node feature.node.kubernetes.io/pci-1200_1e52.present=true` on dev nodes. There's a `hack/dev/label-fake-tt-nodes.yaml` for kind. ## Module loaded but pod NotReady ```bash $ kubectl -n tt-k8s-driver-manager-system get pod -l app.kubernetes.io/component=driver ttdrv-default-... 0/1 Running 0 30s ``` Pod's readiness probe checks `/sys/module/tenstorrent/version == $TT_KMD_VERSION`. If the loaded version doesn't match what the CR wants: - Pod is honest: it's NotReady because the kernel isn't at the declared state. - Most common cause: `rmmod` failed earlier (refcnt > 0) so the new `.ko` was built and `insmod` was called, but `insmod` failed silently because the module is already loaded with the same name. Check the pod's logs for the actual error. Then either drain workloads (refcnt → 0) or reboot the host. ## Host has tt-kmd from DKMS/apt; operator ignored it ```bash $ kubectl get node node-1 -L driver.tenstorrent.com/install-mode NAME INSTALL-MODE node-1 host ``` Expected, not a bug. The builder pod detected `/var/lib/dkms/tenstorrent` or `/usr/src/tenstorrent-/dkms.conf` and stood down. The host's tt-kmd stays, the operator doesn't `rmmod` or rebuild. If you want the operator to take over, follow [Migrating from DKMS](migrating-from-dkms.md) — it has the per-node vacate script, the cluster-side cordon/drain coordination, and the watch-outs (both DKMS signal dirs, refcnt > 0, proxied builder). If you want to keep the host install and just stop driver-manager from touching the node entirely: ```bash kubectl label node driver.tenstorrent.com/skip=true ``` ## tt-smi can't execute on host ``` $ tt-smi -s tt-smi: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.38' not found ``` `/usr/local/bin/tt-smi` is the self-contained binary from tt-smi's GitHub releases, which ships one flavor per Ubuntu release (`tt-smi--ubuntu-22.04`, `-ubuntu-24.04`, ...). The builder image bakes in the flavor matching its `ARG UBUNTU_VERSION`; if that's newer than the host OS, the binary needs glibc symbols the host doesn't have. Fix: set `ARG UBUNTU_VERSION` in `images/driver-build/Dockerfile` to the hosts' Ubuntu release and push. Mixed-OS fleets need one builder image (and so one `TenstorrentDriverPolicy` with a matching `nodeAffinity`) per Ubuntu release. ## Builder pod can't clone tt-kmd — proxy / DNS ``` $ kubectl -n tt-k8s-driver-manager-system logs ttdrv-default-... fatal: unable to access 'https://github.com/tenstorrent/tt-kmd.git/': Could not resolve host: github.com ``` Builder pod has no network path to GitHub for the cache-miss clone. Common when pod egress goes through a proxy (CI behind squid, isolated clusters) and `HTTPS_PROXY` isn't set on the builder. The controller propagates its own `HTTPS_PROXY` / `HTTP_PROXY` / `NO_PROXY` env to spawned builder pods, so the fix is usually on the controller side. Check that the controller pod itself has the proxy env set: ```bash kubectl -n tt-k8s-driver-manager-system get deploy \ tt-k8s-driver-manager-controller -o jsonpath='{.spec.template.spec.containers[0].env}' \ | jq '.[] | select(.name | test("PROXY"))' ``` If empty, set via Helm: ```bash helm upgrade tt-k8s-driver-manager ... \ --set controller.extraEnv[0].name=HTTPS_PROXY \ --set controller.extraEnv[0].value=http://proxy.internal:3128 \ --set controller.extraEnv[1].name=NO_PROXY \ --set controller.extraEnv[1].value=10.0.0.0/8,.svc,.svc.cluster.local ``` The controller restart re-templates the builder DaemonSet with the new env; existing pods need a delete to re-roll. ## Policy never matches — `MessageExternalCordon` on the CR ```bash $ kubectl describe ttfp ... Status: Per Node: Name: node-1 State: Pending Message: node is cordoned but not by this operator ``` The node has `spec.unschedulable: true` but no `firmware.tenstorrent.com/cordoned-by=` (or the driver equivalent `driver.tenstorrent.com/cordoned-by`) annotation. The controller refuses to flash a node it didn't cordon itself — protects SREs who've taken nodes out of rotation for unrelated reasons. Either uncordon and let the controller cordon it itself: ```bash kubectl uncordon ``` …or claim the existing cordon by stamping the annotation: ```bash kubectl annotate node \ firmware.tenstorrent.com/cordoned-by= ``` (Same pattern for driver policies, with `driver.tenstorrent.com/cordoned-by`.) Alternative: set `upgradePolicy.drain.enable: false` on the CR to skip the cordon gate entirely — flash Job will land via its universal toleration regardless of cordon state. Loses the device-pod-eviction safety. ## Flash didn't re-run after editing `spec.flasher.image` You bumped `TenstorrentFirmwarePolicy.spec.flasher.image` (or `forceWrite`, `imagePullPolicy`, etc.) and the controller did nothing. Known limitation: the per-node flash Job name is hashed on `(CR name, node name, kmd version)`. The flasher fields are NOT in the hash. If a Job already exists at that name with `Completed` status, the controller treats the node as Done. Workarounds until #42 ships a fix: - Bump `spec.version` (forces a new Job name). - Delete the per-node Job: `kubectl -n tt-k8s-driver-manager-system delete job -l driver.tenstorrent.com/cr=,driver.tenstorrent.com/node=`. - Wait out the 24h Job TTL — the next reconcile will spawn a fresh Job that picks up the new flasher fields. ## Multiple policies match the same node — `MessageNodeConflict` ```bash $ kubectl describe ttfp ... Message: node is also matched by another firmware policy ``` Two `TenstorrentFirmwarePolicy` (or two `TenstorrentDriverPolicy`) CRs have `nodeSelector` overlap. Controller refuses to flash to avoid racing each other. Find the offenders: ```bash kubectl get ttfp -o json | jq -r ' .items[] | {name: .metadata.name, selector: .spec.nodeSelector}' ``` Narrow one selector so each node matches exactly one CR. Typical mistake: a "wildcard" CR (empty `nodeSelector`) sitting next to a team-scoped CR. Add a `matchExpressions` `NotIn` on the wildcard or delete it. ## CR keeps re-flashing despite node being at the right version The firmware controller's "this node is done" signal is a `Complete` Job in its own namespace. If you (a) moved the controller to a new namespace, or (b) deleted all old flash Jobs, the controller has no record that the node was flashed and re-runs. Pre-flash readback (inside the flasher Job) will report the existing version matches the desired, and tt-flash with `--force=false` will no-op. So it's harmless, just slow. The node's `current-version` annotation is the cheaper source of truth — bumping the controller to read it as a tiebreaker is open work. ## Controller is in a tight reconcile loop `kubectl logs -n tt-k8s-driver-manager-system deploy/tt-k8s-driver-manager-controller` shows "updated tt-kmd DaemonSet" multiple times per second. Old symptom (fixed): the diff predicate compared image but missed other template fields. If you see this on a current version, file a bug — the `driver.tenstorrent.com/template-hash` annotation on the DS should be making this impossible. ## Fully clean a host When a node has been managed by both this operator and a host-side DKMS installer (or earlier non-container installer pods) and you want to start fresh: ```bash # As root on the node: rmmod tenstorrent # may fail if refcnt>0 rm -rf /usr/src/tenstorrent-* rm -rf /var/lib/dkms/tenstorrent rm -f /lib/modules/*/updates/dkms/tenstorrent.ko* rm -f /etc/modules-load.d/tenstorrent.conf rm -rf /opt/tt rm -f /usr/local/bin/tt-smi rm -rf /var/cache/tt-kmd/ for kdir in /lib/modules/*/; do depmod -a "$(basename $kdir)"; done ``` After reboot, the host should have no trace of tt-kmd, and the builder pod will fall through to its build path on first reconcile. This can also be scripted via `kubectl debug node/ --profile=sysadmin -- chroot /host bash -c '