Telemetry

tt-telemetry collects per-device health and exposes it on a Prometheus /metrics endpoint. A collector runs on every Tenstorrent node and reports the state of each device on that node. It is one of the components installed by tt-operator.

Metrics

The collector serves Prometheus metrics including tt_driver_initialized, a gauge that reads 1 once the driver is up on the host. Per-device metrics carry topology identity labels (tray and chip) sourced from the device’s physical system descriptor, so you can attribute a metric to a specific tray and chip.

See the Metrics reference for the full list of exported metrics.

Scrape the endpoint directly to check it:

kubectl -n tt-operator-system port-forward <telemetry-collector-pod> 8080:8080
curl -s localhost:8080/metrics | grep tt_driver_initialized

Prometheus integration

tt-telemetry ships a PodMonitor for the Prometheus Operator. If your cluster runs the Prometheus Operator, the collector is discovered and scraped automatically. If it does not, because the monitoring.coreos.com resources are absent, disable the PodMonitor to avoid an install-time error with podMonitor.enabled=false (or tt-telemetry.podMonitor.enabled=false when installing through tt-operator). You can still scrape the endpoint by other means.

Resilience

The collector tolerates device and driver churn. During a tt-kmd reinstall the device briefly disappears and the collector may restart, but /metrics becomes healthy again once the driver is back. This is expected and not an error.

Topology identity

The collector can resolve richer topology from the Fabric Manager via config.fabric_manager_address. Where no topology is staged, the collector falls back to monitoring all device channels. Metrics remain available either way.

Configuration

See the Configuration page for the chart’s full set of values, including the namespace, image, aggregator, and PodMonitor settings.