Continuous Operations
Upgrade tt-operator
Upgrade the release in place with Helm:
helm upgrade tt-operator oci://ghcr.io/tenstorrent/helm/tt-operator \
--namespace tt-operator-system --reuse-values
Enabled controller Deployments roll to the new version and existing custom resource definitions are preserved.
Re-applying vendored CRDs
Some subcharts ship their CRDs out of band from the Helm release, currently
JobSet, and the umbrella chart vendors those CRDs. Helm applies them on install
only. helm upgrade deliberately skips them, the Helm 3 convention that prevents
a chart from silently changing CRD schemas under live resources. When upgrading
to a chart version that bumps a subchart owning vendored CRDs, re-apply them
yourself:
helm pull oci://ghcr.io/tenstorrent/helm/tt-operator --version <new> --untar -d /tmp/tt-operator-pull
kubectl apply --server-side --force-conflicts -f /tmp/tt-operator-pull/tt-operator/crds/
helm upgrade tt-operator oci://ghcr.io/tenstorrent/helm/tt-operator --version <new> \
-n tt-operator-system --reuse-values
--server-side is required because the JobSet CRD schema exceeds the
client-side apply annotation limit.
Upgrade the driver
Driver version transitions are driven by the TenstorrentDriverPolicy, not by a
chart upgrade. Change spec.version and re-apply. With drain enabled, the
operator cordons and drains the node, rebuilds and reloads tt-kmd, then
uncordons it. Driver upgrades also pause telemetry first. The controller sets the
tenstorrent.com/deploy.tt-telemetry node gate to drain the collector so it
releases the device, then restores it once the new driver is ready. See the
Driver Manager component page.
Uninstall
Remove any policy custom resources first, then uninstall the release:
kubectl delete tenstorrentdriverpolicies --all
kubectl delete tenstorrentfirmwarepolicies --all
helm uninstall tt-operator -n tt-operator-system
helm uninstall removes the operands, including the controllers, DaemonSets, and
telemetry. By Helm convention the custom resource definitions are not removed on
uninstall. Delete them explicitly if you want them gone:
kubectl delete crd tenstorrentdriverpolicies.driver.tenstorrent.com \
tenstorrentfirmwarepolicies.firmware.tenstorrent.com
Collect diagnostics
When something looks wrong, capture the namespace state before tearing anything down:
kubectl get pods -A -o wide
kubectl -n tt-operator-system describe pods
kubectl -n tt-operator-system get ds,deploy,sa
kubectl get events -n tt-operator-system --sort-by=.lastTimestamp | tail -50
See Troubleshooting for how to read the common failure signals.