Overview

tt-operator is an umbrella Helm chart. Installing it deploys a set of cooperating components that take a node from having Tenstorrent devices installed to running and monitoring Tenstorrent workloads, without hand-installing drivers or wiring up device plugins.

What it installs

Component

Role

Node Feature Discovery

Labels nodes that have a Tenstorrent device.

Driver Manager

Installs, upgrades, and scopes the tt-kmd driver, and flashes firmware.

Telemetry

Exposes device health on a Prometheus endpoint.

Fabric Manager

Resolves fabric topology across devices and hosts.

DRA Driver

Publishes devices as schedulable resources via Kubernetes Dynamic Resource Allocation.

Multi-Node Scheduling

Groups and wires up multi-node jobs.

Each component can be enabled or disabled independently. See Installation and the Configuration reference.

How the pieces fit together

        flowchart TD
    NFD[Node Feature Discovery] -->|labels nodes| DM[Driver Manager]
    NFD -->|labels nodes| TEL[Telemetry]
    DM -->|installs tt-kmd, flashes firmware| DEV[(Tenstorrent device)]
    TEL -->|/metrics| PROM[Prometheus]
    FM[Fabric Manager] -->|topology| DRA[DRA Driver]
    FM -->|topology| TEL
    DRA -->|schedulable devices| WL[Workloads]
    DEV --- TEL
    DEV --- DRA
    

Node Feature Discovery labels the nodes. The Driver Manager brings up tt-kmd and firmware on those nodes. Telemetry reports device health.

The remaining components extend this. The Fabric Manager resolves topology that the DRA Driver and Telemetry consume, and the DRA Driver makes devices available to workloads.

tt-operator manages the driver and firmware lifecycle through declarative policy custom resources, so operations such as upgrades and node scoping are expressed as Kubernetes objects that you apply and observe.