Troubleshooting

Start by capturing namespace state. See Collect diagnostics, then match the symptom below.

helm install fails with a cert-manager error

The error mentions no matches for kind "Issuer" or "Certificate". cert-manager is not installed, and the bundled kubepmix webhook needs it. Install cert-manager, or install without kubepmix:

--set kubepmix.enabled=false

helm install fails with a PodMonitor error

The error mentions no matches for kind "PodMonitor". The Prometheus Operator resources (monitoring.coreos.com) are not present, and tt-telemetry ships a PodMonitor by default. Disable it:

--set tt-telemetry.podMonitor.enabled=false

You can still scrape /metrics by other means. See Telemetry.

Pods stuck in ImagePullBackOff

The node cannot pull from ghcr.io. Confirm outbound registry access and, if your registry requires authentication, that a valid pull secret is configured. Inspect the failing pod:

kubectl -n tt-operator-system describe pod <pod>

NFD did not label a node

kubectl get nodes -l feature.node.kubernetes.io/pci-1200_1e52.present=true

If a node with a device is missing, confirm the device is visible on the host with lspci | grep -i tenstorrent, and that the NFD worker pod is Running on that node. Labeling is asynchronous, so allow a short interval after install.

No /dev/tenstorrent devices on a node

The driver is not loaded. Check, in order:

kubectl get tenstorrentdriverpolicies                       # is a policy applied?
kubectl -n tt-operator-system get ds,pods                   # is the per-policy builder DaemonSet running?
cat /sys/module/tenstorrent/version                          # on the node, is the module loaded?

A common root cause is the builder failing to compile tt-kmd because the node’s kernel headers are missing. Inspect the builder pod logs:

kubectl -n tt-operator-system logs <driver-builder-pod>

A ResourceClaim never binds (DRA)

kubectl get resourceslices

If there are no device entries, the DRA Driver has no resolvable fabric topology on the node, so it publishes nothing and the claim cannot bind. This is an environment limitation, not a fault.

Telemetry collector restarts during a driver install

This is expected. The device briefly disappears while tt-kmd is reinstalled and the collector restarts. /metrics becomes healthy again once the driver is back. See Telemetry.