# Upgrades

Three independent things can be upgraded; the procedure is different
for each.

## tt-kmd

Bump `spec.version` on the `TenstorrentDriverPolicy`:

```bash
kubectl patch ttdp default --type merge -p '{"spec":{"version":"2.8.0"}}'
```

Per-node sequence (full state machine in [Upgrade flow](driver.md#upgrade-flow)):

1. Controller cordons the node and flips
   `controller.deployGates` labels (default
   `tenstorrent.com/deploy.tt-telemetry=false`) so sibling DSes that
   hold `/dev/tenstorrent` evict themselves.
2. Two-pass drain: pass 1 evicts hostPath `/dev/tenstorrent` holders;
   pass 2 (`drain.fullNode`, default on) is full `kubectl drain`
   semantics.
3. Controller re-renders the DS template with the new
   `TT_KMD_VERSION` env, stamps a new `template-hash` annotation.
4. K8s rolling update: `maxUnavailable: 1` — one node at a time.
5. New pod's entrypoint: checks `refcnt`; if 0, `rmmod tenstorrent`,
   build from cache or clone + `make modules`, `insmod`.
   `upgradePolicy.forceUnload=true` SIGKILLs remaining holders via
   `/proc/*/fd` walk before `rmmod`.
6. Readiness probe (`/sys/module/tenstorrent/version` matches expected)
   becomes True; controller uncordons and removes the deploy-gate
   labels; rolling update advances.

Wall-clock per node:

- Cache hit (this version was loaded here before): ~5s.
- Cache miss, clone + build: ~30–90s on a Wormhole node with no
  workload.

Downgrade is the same — patch to a lower version. The `.ko` for the
target version is already cached if it's been on this node before;
otherwise the builder clones + makes it fresh.

### Blocked by workload

The pre-upgrade drain (above) normally drops `refcnt` to 0 before the
new builder pod runs. If a holder still survives — e.g.
`drain.enable=false`, or a pod with `tolerations: Exists` that respawns
on the cordoned node — the entrypoint fails loudly with the holder
PIDs and the pod CrashLoops; the CR's `status.summary.failed`
increments. To unblock, either:

- Set `spec.upgradePolicy.forceUnload=true` and let the next reconcile
  SIGKILL the holders (lose in-flight workloads on that node).
- Find + drain the holders manually, then delete the CrashLooping pod
  (`kubectl delete pod ttdrv-...`) to retry.

## tt-smi

tt-smi version is baked into the builder image at image-build time
(`ARG TT_SMI_VERSION`). To upgrade across the fleet, bump the chart's
`driver.image.tag` to a builder image tag built with the new tt-smi
version.

```bash
helm -n tt-k8s-driver-manager-system upgrade tt-k8s-driver-manager \
  oci://ghcr.io/tenstorrent/helm/tt-k8s-driver-manager \
  --reuse-values \
  --set driver.image.tag=sha-newer
```

(Or set `driver.image.repository` too if you're pulling from a mirror.)

Per-node sequence:

1. Controller's DS template gets the new image tag; template hash
   changes; rolling update.
2. New pod's entrypoint writes the binary to a sibling path and
   `rename(2)`s it over `/host/usr/local/bin/tt-smi` — concurrent
   `tt-smi -s` calls on the host see either the old binary or the new
   one, never a partial file.
3. Patches `tt-smi.driver.tenstorrent.com/version` on the node.

## Firmware

Bump `spec.version` on the `TenstorrentFirmwarePolicy`:

```bash
kubectl patch ttfwp default --type merge -p '{"spec":{"version":"19.9.0"}}'
```

Per-node sequence is the state machine described in [Firmware Management](firmware.md):

`Pending → (Cordoning → Draining)? → Flashing → Uncordoning → Done`

Wall-clock per node: ~40s for the Job itself plus drain time if drain
is enabled.

To re-flash the SAME version (e.g. recovery from a corrupted flash),
set `spec.flasher.forceWrite: true`. The controller deletes the
existing Complete Job for that `(CR, node, version)` and creates a
new one.

## The operator itself

Standard Helm upgrade:

```bash
helm -n tt-k8s-driver-manager-system upgrade tt-k8s-driver-manager \
  oci://ghcr.io/tenstorrent/helm/tt-k8s-driver-manager \
  --version 0.2.0   # the chart version, not the tt-kmd version
```

Controller Deployment rolls (1 replica → terminating → new pod up).
During the ~5–10s the controller is down:

- Existing DS pods keep running with their last-applied template.
- The CR is not reconciled — but no in-flight Job is affected.
- Webhooks (if any) are momentarily unavailable; helm/kubectl
  operations on the CRDs queue.

If the new operator changes the DS pod template (e.g. a new mount, a
new env), the rolling update of every DS kicks in as soon as the new
controller leader comes up. That's a fleet-wide pod churn; nothing
on the chip side moves, but expect the DS rollout to run for ~30s ×
number of nodes.

## Operator downgrade

```bash
helm -n tt-k8s-driver-manager-system rollback tt-k8s-driver-manager
```

CRs are unaffected. The controller pod restarts with the previous
image. If the previous version doesn't know about fields you've set
in the CR (e.g. you added a `spec.foo` post-upgrade), that field is
ignored — no data loss, just no enforcement.

## CRD upgrades

The CRDs themselves live in the chart's `crds/` directory. Helm's
convention is that CRDs in `crds/` are installed on first install and
NOT modified on upgrade — to prevent data loss from schema changes
that would invalidate existing CRs.

To pick up a CRD schema change (new field, new validation), apply
manually:

```bash
kubectl apply -f charts/tt-k8s-driver-manager/crds/
```

Or use `helm upgrade --force` — but read the chart's release notes first;
removing a required field from the schema mid-deploy will reject
existing CRs.

## Rolling-update knobs

| Concern | Knob |
|---|---|
| Per-CR parallelism (firmware) | `spec.upgradePolicy.maxParallel` |
| Per-CR parallelism (driver) | DS `maxUnavailable: 1` (hardcoded; one-node-at-a-time is the only sensible default for kernel modules) |
| Per-node timeout (firmware) | `spec.upgradePolicy.flashTimeoutSeconds` |
| Per-node drain timeout | `spec.upgradePolicy.drain.timeoutSeconds` |
| Drain pass 2 on/off (driver) | `spec.upgradePolicy.drain.fullNode` |
| Drain pass 2 label filter (driver) | `spec.upgradePolicy.drain.podSelectorLabel` |
| SIGKILL device holders (driver) | `spec.upgradePolicy.forceUnload` |
| Sibling-DS deploy gates (driver) | chart `controller.deployGates` |
| Soft pause | `spec.paused: true` on the CR (both kinds) |
| Hard pause (per-node) | `driver.tenstorrent.com/skip=true` or `firmware.tenstorrent.com/skip=true` label on the node |