Overview
tt-operator is an umbrella Helm chart. Installing it deploys a set of cooperating components that take a node from having Tenstorrent devices installed to running and monitoring Tenstorrent workloads, without hand-installing drivers or wiring up device plugins.
What it installs
Component |
Role |
|---|---|
Node Feature Discovery |
Labels nodes that have a Tenstorrent device. |
Driver Manager |
Installs, upgrades, and scopes the |
Telemetry |
Exposes device health on a Prometheus endpoint. |
Fabric Manager |
Resolves fabric topology across devices and hosts. |
DRA Driver |
Publishes devices as schedulable resources via Kubernetes Dynamic Resource Allocation. |
Multi-Node Scheduling |
Groups and wires up multi-node jobs. |
Each component can be enabled or disabled independently. See Installation and the Configuration reference.
How the pieces fit together
flowchart TD
NFD[Node Feature Discovery] -->|labels nodes| DM[Driver Manager]
NFD -->|labels nodes| TEL[Telemetry]
DM -->|installs tt-kmd, flashes firmware| DEV[(Tenstorrent device)]
TEL -->|/metrics| PROM[Prometheus]
FM[Fabric Manager] -->|topology| DRA[DRA Driver]
FM -->|topology| TEL
DRA -->|schedulable devices| WL[Workloads]
DEV --- TEL
DEV --- DRA
Node Feature Discovery labels the nodes. The Driver Manager brings up tt-kmd
and firmware on those nodes. Telemetry reports device health.
The remaining components extend this. The Fabric Manager resolves topology that the DRA Driver and Telemetry consume, and the DRA Driver makes devices available to workloads.
tt-operator manages the driver and firmware lifecycle through declarative policy custom resources, so operations such as upgrades and node scoping are expressed as Kubernetes objects that you apply and observe.