Firmware Management
The firmware controller flashes Tenstorrent device firmware via per-node
Job pods that run tt-flash. State is declared via
TenstorrentFirmwarePolicy (short name: ttfwp).
The minimum CR
apiVersion: firmware.tenstorrent.com/v1alpha1
kind: TenstorrentFirmwarePolicy
metadata:
name: default
spec:
version: "19.8.0"
nodeAffinity: {}
What happens:
Controller walks each matched node through a state machine:
Pending → (Cordoning → Draining)? → Flashing → Uncordoning → Done.For each node, a Job is created in the operator namespace using the flasher image (
ghcr.io/tenstorrent/tt-k8s-driver-manager-flasher). The Job:Reads pre-flash version via
tt-smi -s.Downloads
fw_pack-<version>.fwbundlefromgithub.com/tenstorrent/tt-system-firmwarereleases.Runs
tt-flash --no-color flash <bundle>(with--forceifspec.flasher.forceWrite=true).Asserts post-flash readback equals
spec.readbackVersion(default<version>.0to match the firmware bundle’s readback format).
Job’s exit code is the controller’s signal — no separate readback step in the reconcile loop. A non-zero exit moves the node to
Failedwith the Job’s last log lines surfaced in CR status.
Spec fields
Field |
Default |
Purpose |
|---|---|---|
|
required |
Firmware bundle version. |
|
|
What |
|
github tt-system-firmware release |
Pin to a specific URL (mirror, internal repo, signed copy). |
|
required |
Same shape as the driver CR. The v1alpha1 alias |
|
|
Soft stop. In-flight Jobs not interrupted; new ones don’t start. |
|
|
Nodes flashing simultaneously across this CR. Crank up only if a bad fw bundle can’t brick the fleet faster than you can |
|
|
Halt the rollout the moment any node hits |
|
|
Per-node Job timeout. PCIe-only typically <120s; Galaxy headroom. |
|
|
Cordon+drain pods that hold |
|
|
Per-node drain timeout. After this, node moves to |
|
|
Delete pods that have no controller (bare Pods) instead of evicting. |
|
chart’s |
Per-CR override of the flasher image. |
|
|
Override for the above. |
|
|
Bypass the “current readback already matches target” short-circuit and pass |
|
|
Continue with the flash even if |
CR examples
Full-fleet flash
apiVersion: firmware.tenstorrent.com/v1alpha1
kind: TenstorrentFirmwarePolicy
metadata: { name: fleet }
spec:
version: "19.8.0"
nodeAffinity: {}
upgradePolicy:
maxParallel: 1 # one node at a time — bad fw shouldn't lose the cluster
drain:
enable: true
timeoutSeconds: 600
Force re-flash (same version)
spec:
version: "19.8.0"
flasher:
forceWrite: true
forceWrite bypasses the “already at target” short-circuit and passes
--force to tt-flash — overwrite, readback re-asserts.
Downgrade
spec:
version: "19.7.0"
flasher:
forceWrite: true # 19.8.0 → 19.7.0 needs --force
upgradePolicy:
maxParallel: 1 # downgrade is the riskiest direction; serial
Mirror / pinned bundle
spec:
version: "19.8.0"
bundleURL: "https://internal.example.com/fw/fw_pack-19.8.0.fwbundle"
readbackVersion: "19.8.0.0" # explicit; helps when bundle metadata is odd
Drain semantics
When upgradePolicy.drain.enable: true (default), the per-node state
machine walks: Pending → Cordoning → Draining → Flashing → Uncordoning → Done.
Cordoning sets
node.spec.unschedulable=trueplus our annotationfirmware.tenstorrent.com/cordoned-by=<crname>. The annotation is load-bearing: we only uncordon nodes WE cordoned, never stomping on an external maintenance window’s cordon.Draining identifies “device-using pods” by hostPath mount on
/dev/tenstorrentand evicts them via the policy/v1 Eviction subresource (PDB-respecting; 429s on PDB-block are surfaced as transient status with retry on next reconcile). Excludes the operator’s own namespace and DaemonSet-owned pods.Flashing is the actual Job described above.
Uncordoning removes the unschedulable flag + our annotation.
Drain treadmill caveat
Deployment-managed pods with tolerations: [{operator: Exists}] bypass
cordon (the implicit unschedulable taint is tolerated). Eviction
succeeds; the deployment controller respawns the pod on the same
cordoned node; eviction loops until drain.timeoutSeconds fires. On
timeout the node moves to Failed with the blocking pod list in
status.
Workaround: don’t give workloads tolerations: Exists unless you have
to. Or set drain.enable: false on this CR (and accept that flashing
may race against in-flight workloads).
Skipping drain on a specific node
kubectl label node <name> firmware.tenstorrent.com/skip=true — opts
that one node out of all firmware reconciliation regardless of
selector. Separate from the driver-side driver.tenstorrent.com/skip.
Upgrade flow
Same pattern as the driver: patch spec.version. Per-node Jobs roll
through with whatever parallelism + drain config is set:
kubectl patch ttfwp default --type merge -p '{"spec":{"version":"19.9.0"}}'
The controller is idempotent at the Job level: a Job for
(CR, node, version) is created at most once. If the same flash is
re-requested (e.g. you kubectl delete pod a stuck flasher) the
Complete Job is reused as evidence that this node is done.
Watch progress
$ kubectl get ttfwp default
NAME VERSION MATCHED UPTODATE INPROGRESS FAILED AGE
default 19.9.0 3 2 1 0 3m
$ kubectl get ttfwp default -o jsonpath='{.status.nodes}' | jq
[
{"name":"node-1","currentVersion":"19.9.0.0","state":"Done"},
{"name":"node-2","currentVersion":"19.9.0.0","state":"Done"},
{"name":"node-3","currentVersion":"19.8.0.0","state":"Flashing",
"lastFlashJob":"ttfwp-default-node-3-19-9-0-abc1234"}
]
Watch the flasher Job
$ kubectl -n tt-operator-system get jobs -l firmware.tenstorrent.com/cr=default
NAME STATUS COMPLETIONS DURATION
ttfwp-default-node-1-19-9-0-abc1234 Complete 1/1 34s
ttfwp-default-node-2-19-9-0-abc1234 Complete 1/1 36s
ttfwp-default-node-3-19-9-0-abc1234 Running 0/1 18s
$ kubectl -n tt-operator-system logs job/ttfwp-default-node-3-19-9-0-abc1234
[flasher] pre-flash: tt-smi -s
[flasher] pre-flash versions: 19.8.0.0 19.8.0.0 19.8.0.0 ...
[flasher] flash: tt-flash --no-color flash --fw-tar /work/bundle.fwbundle
Stage: DETECT (8 chips)
Stage: FLASH (~30s)
...
Node labels + annotations
Field |
Where |
Purpose |
|---|---|---|
|
label |
currentVersion after a successful flash |
|
label |
which CR is reconciling this node (first-write-wins) |
|
label |
per-node SM position |
|
annotation |
readback after last flash |
|
annotation |
most recent flash Job name |
|
annotation |
the CR that cordoned (so we only uncordon what we cordoned) |
|
annotation |
RFC3339 — drain timeout reference |
kubectl plugin
kubectl-tt-fw collapses CR + per-node state + last Job into a single
table. Install via make install-plugins.
kubectl tt fw # per-CR table
kubectl tt fw logs <crname> # tail logs from in-flight Jobs