Upgrades

Three independent things can be upgraded; the procedure is different for each.

tt-kmd

Bump spec.version on the TenstorrentDriverPolicy:

kubectl patch ttdp default --type merge -p '{"spec":{"version":"2.8.0"}}'

Per-node sequence (full state machine in Upgrade flow):

  1. Controller cordons the node and flips controller.deployGates labels (default tenstorrent.com/deploy.tt-telemetry=false) so sibling DSes that hold /dev/tenstorrent evict themselves.

  2. Two-pass drain: pass 1 evicts hostPath /dev/tenstorrent holders; pass 2 (drain.fullNode, default on) is full kubectl drain semantics.

  3. Controller re-renders the DS template with the new TT_KMD_VERSION env, stamps a new template-hash annotation.

  4. K8s rolling update: maxUnavailable: 1 — one node at a time.

  5. New pod’s entrypoint: checks refcnt; if 0, rmmod tenstorrent, build from cache or clone + make modules, insmod. upgradePolicy.forceUnload=true SIGKILLs remaining holders via /proc/*/fd walk before rmmod.

  6. Readiness probe (/sys/module/tenstorrent/version matches expected) becomes True; controller uncordons and removes the deploy-gate labels; rolling update advances.

Wall-clock per node:

  • Cache hit (this version was loaded here before): ~5s.

  • Cache miss, clone + build: ~30–90s on a Wormhole node with no workload.

Downgrade is the same — patch to a lower version. The .ko for the target version is already cached if it’s been on this node before; otherwise the builder clones + makes it fresh.

Blocked by workload

The pre-upgrade drain (above) normally drops refcnt to 0 before the new builder pod runs. If a holder still survives — e.g. drain.enable=false, or a pod with tolerations: Exists that respawns on the cordoned node — the entrypoint fails loudly with the holder PIDs and the pod CrashLoops; the CR’s status.summary.failed increments. To unblock, either:

  • Set spec.upgradePolicy.forceUnload=true and let the next reconcile SIGKILL the holders (lose in-flight workloads on that node).

  • Find + drain the holders manually, then delete the CrashLooping pod (kubectl delete pod ttdrv-...) to retry.

tt-smi

tt-smi version is baked into the builder image at image-build time (ARG TT_SMI_VERSION). To upgrade across the fleet, bump the chart’s driver.image.tag to a builder image tag built with the new tt-smi version.

helm -n tt-k8s-driver-manager-system upgrade tt-k8s-driver-manager \
  oci://ghcr.io/tenstorrent/helm/tt-k8s-driver-manager \
  --reuse-values \
  --set driver.image.tag=sha-newer

(Or set driver.image.repository too if you’re pulling from a mirror.)

Per-node sequence:

  1. Controller’s DS template gets the new image tag; template hash changes; rolling update.

  2. New pod’s entrypoint writes the binary to a sibling path and rename(2)s it over /host/usr/local/bin/tt-smi — concurrent tt-smi -s calls on the host see either the old binary or the new one, never a partial file.

  3. Patches tt-smi.driver.tenstorrent.com/version on the node.

Firmware

Bump spec.version on the TenstorrentFirmwarePolicy:

kubectl patch ttfwp default --type merge -p '{"spec":{"version":"19.9.0"}}'

Per-node sequence is the state machine described in Firmware Management:

Pending (Cordoning Draining)? Flashing Uncordoning Done

Wall-clock per node: ~40s for the Job itself plus drain time if drain is enabled.

To re-flash the SAME version (e.g. recovery from a corrupted flash), set spec.flasher.forceWrite: true. The controller deletes the existing Complete Job for that (CR, node, version) and creates a new one.

The operator itself

Standard Helm upgrade:

helm -n tt-k8s-driver-manager-system upgrade tt-k8s-driver-manager \
  oci://ghcr.io/tenstorrent/helm/tt-k8s-driver-manager \
  --version 0.2.0   # the chart version, not the tt-kmd version

Controller Deployment rolls (1 replica → terminating → new pod up). During the ~5–10s the controller is down:

  • Existing DS pods keep running with their last-applied template.

  • The CR is not reconciled — but no in-flight Job is affected.

  • Webhooks (if any) are momentarily unavailable; helm/kubectl operations on the CRDs queue.

If the new operator changes the DS pod template (e.g. a new mount, a new env), the rolling update of every DS kicks in as soon as the new controller leader comes up. That’s a fleet-wide pod churn; nothing on the chip side moves, but expect the DS rollout to run for ~30s × number of nodes.

Operator downgrade

helm -n tt-k8s-driver-manager-system rollback tt-k8s-driver-manager

CRs are unaffected. The controller pod restarts with the previous image. If the previous version doesn’t know about fields you’ve set in the CR (e.g. you added a spec.foo post-upgrade), that field is ignored — no data loss, just no enforcement.

CRD upgrades

The CRDs themselves live in the chart’s crds/ directory. Helm’s convention is that CRDs in crds/ are installed on first install and NOT modified on upgrade — to prevent data loss from schema changes that would invalidate existing CRs.

To pick up a CRD schema change (new field, new validation), apply manually:

kubectl apply -f charts/tt-k8s-driver-manager/crds/

Or use helm upgrade --force — but read the chart’s release notes first; removing a required field from the schema mid-deploy will reject existing CRs.

Rolling-update knobs

Concern

Knob

Per-CR parallelism (firmware)

spec.upgradePolicy.maxParallel

Per-CR parallelism (driver)

DS maxUnavailable: 1 (hardcoded; one-node-at-a-time is the only sensible default for kernel modules)

Per-node timeout (firmware)

spec.upgradePolicy.flashTimeoutSeconds

Per-node drain timeout

spec.upgradePolicy.drain.timeoutSeconds

Drain pass 2 on/off (driver)

spec.upgradePolicy.drain.fullNode

Drain pass 2 label filter (driver)

spec.upgradePolicy.drain.podSelectorLabel

SIGKILL device holders (driver)

spec.upgradePolicy.forceUnload

Sibling-DS deploy gates (driver)

chart controller.deployGates

Soft pause

spec.paused: true on the CR (both kinds)

Hard pause (per-node)

driver.tenstorrent.com/skip=true or firmware.tenstorrent.com/skip=true label on the node