Upgrades
Three independent things can be upgraded; the procedure is different for each.
tt-kmd
Bump spec.version on the TenstorrentDriverPolicy:
kubectl patch ttdp default --type merge -p '{"spec":{"version":"2.8.0"}}'
Per-node sequence (full state machine in Upgrade flow):
Controller cordons the node and flips
controller.deployGateslabels (defaulttenstorrent.com/deploy.tt-telemetry=false) so sibling DSes that hold/dev/tenstorrentevict themselves.Two-pass drain: pass 1 evicts hostPath
/dev/tenstorrentholders; pass 2 (drain.fullNode, default on) is fullkubectl drainsemantics.Controller re-renders the DS template with the new
TT_KMD_VERSIONenv, stamps a newtemplate-hashannotation.K8s rolling update:
maxUnavailable: 1— one node at a time.New pod’s entrypoint: checks
refcnt; if 0,rmmod tenstorrent, build from cache or clone +make modules,insmod.upgradePolicy.forceUnload=trueSIGKILLs remaining holders via/proc/*/fdwalk beforermmod.Readiness probe (
/sys/module/tenstorrent/versionmatches expected) becomes True; controller uncordons and removes the deploy-gate labels; rolling update advances.
Wall-clock per node:
Cache hit (this version was loaded here before): ~5s.
Cache miss, clone + build: ~30–90s on a Wormhole node with no workload.
Downgrade is the same — patch to a lower version. The .ko for the
target version is already cached if it’s been on this node before;
otherwise the builder clones + makes it fresh.
Blocked by workload
The pre-upgrade drain (above) normally drops refcnt to 0 before the
new builder pod runs. If a holder still survives — e.g.
drain.enable=false, or a pod with tolerations: Exists that respawns
on the cordoned node — the entrypoint fails loudly with the holder
PIDs and the pod CrashLoops; the CR’s status.summary.failed
increments. To unblock, either:
Set
spec.upgradePolicy.forceUnload=trueand let the next reconcile SIGKILL the holders (lose in-flight workloads on that node).Find + drain the holders manually, then delete the CrashLooping pod (
kubectl delete pod ttdrv-...) to retry.
tt-smi
tt-smi version is baked into the builder image at image-build time
(ARG TT_SMI_VERSION). To upgrade across the fleet, bump the chart’s
driver.image.tag to a builder image tag built with the new tt-smi
version.
helm -n tt-k8s-driver-manager-system upgrade tt-k8s-driver-manager \
oci://ghcr.io/tenstorrent/helm/tt-k8s-driver-manager \
--reuse-values \
--set driver.image.tag=sha-newer
(Or set driver.image.repository too if you’re pulling from a mirror.)
Per-node sequence:
Controller’s DS template gets the new image tag; template hash changes; rolling update.
New pod’s entrypoint writes the binary to a sibling path and
rename(2)s it over/host/usr/local/bin/tt-smi— concurrenttt-smi -scalls on the host see either the old binary or the new one, never a partial file.Patches
tt-smi.driver.tenstorrent.com/versionon the node.
Firmware
Bump spec.version on the TenstorrentFirmwarePolicy:
kubectl patch ttfwp default --type merge -p '{"spec":{"version":"19.9.0"}}'
Per-node sequence is the state machine described in Firmware Management:
Pending → (Cordoning → Draining)? → Flashing → Uncordoning → Done
Wall-clock per node: ~40s for the Job itself plus drain time if drain is enabled.
To re-flash the SAME version (e.g. recovery from a corrupted flash),
set spec.flasher.forceWrite: true. The controller deletes the
existing Complete Job for that (CR, node, version) and creates a
new one.
The operator itself
Standard Helm upgrade:
helm -n tt-k8s-driver-manager-system upgrade tt-k8s-driver-manager \
oci://ghcr.io/tenstorrent/helm/tt-k8s-driver-manager \
--version 0.2.0 # the chart version, not the tt-kmd version
Controller Deployment rolls (1 replica → terminating → new pod up). During the ~5–10s the controller is down:
Existing DS pods keep running with their last-applied template.
The CR is not reconciled — but no in-flight Job is affected.
Webhooks (if any) are momentarily unavailable; helm/kubectl operations on the CRDs queue.
If the new operator changes the DS pod template (e.g. a new mount, a new env), the rolling update of every DS kicks in as soon as the new controller leader comes up. That’s a fleet-wide pod churn; nothing on the chip side moves, but expect the DS rollout to run for ~30s × number of nodes.
Operator downgrade
helm -n tt-k8s-driver-manager-system rollback tt-k8s-driver-manager
CRs are unaffected. The controller pod restarts with the previous
image. If the previous version doesn’t know about fields you’ve set
in the CR (e.g. you added a spec.foo post-upgrade), that field is
ignored — no data loss, just no enforcement.
CRD upgrades
The CRDs themselves live in the chart’s crds/ directory. Helm’s
convention is that CRDs in crds/ are installed on first install and
NOT modified on upgrade — to prevent data loss from schema changes
that would invalidate existing CRs.
To pick up a CRD schema change (new field, new validation), apply manually:
kubectl apply -f charts/tt-k8s-driver-manager/crds/
Or use helm upgrade --force — but read the chart’s release notes first;
removing a required field from the schema mid-deploy will reject
existing CRs.
Rolling-update knobs
Concern |
Knob |
|---|---|
Per-CR parallelism (firmware) |
|
Per-CR parallelism (driver) |
DS |
Per-node timeout (firmware) |
|
Per-node drain timeout |
|
Drain pass 2 on/off (driver) |
|
Drain pass 2 label filter (driver) |
|
SIGKILL device holders (driver) |
|
Sibling-DS deploy gates (driver) |
chart |
Soft pause |
|
Hard pause (per-node) |
|