Telemetry
tt-telemetry collects per-device health and exposes it on a Prometheus
/metrics endpoint. A collector runs on every Tenstorrent node and reports the
state of each device on that node. It is one of the components installed by
tt-operator.
Metrics
The collector serves Prometheus metrics including tt_driver_initialized, a
gauge that reads 1 once the driver is up on the host. Per-device metrics carry
topology identity labels (tray and chip) sourced from the device’s physical
system descriptor, so you can attribute a metric to a specific tray and chip.
See the Metrics reference for the full list of exported metrics.
Scrape the endpoint directly to check it:
kubectl -n tt-operator-system port-forward <telemetry-collector-pod> 8080:8080
curl -s localhost:8080/metrics | grep tt_driver_initialized
Prometheus integration
tt-telemetry ships a PodMonitor for the Prometheus Operator. If your cluster
runs the Prometheus Operator, the collector is discovered and scraped
automatically. If it does not, because the monitoring.coreos.com resources are
absent, disable the PodMonitor to avoid an install-time error with
podMonitor.enabled=false (or tt-telemetry.podMonitor.enabled=false when
installing through tt-operator). You can still scrape the endpoint by other
means.
Resilience
The collector tolerates device and driver churn. During a tt-kmd reinstall the
device briefly disappears and the collector may restart, but /metrics becomes
healthy again once the driver is back. This is expected and not an error.
Topology identity
The collector can resolve richer topology from the
Fabric Manager via
config.fabric_manager_address. Where no topology is staged, the collector
falls back to monitoring all device channels. Metrics remain available either
way.
Configuration
See the Configuration page for the chart’s full set of values, including the namespace, image, aggregator, and PodMonitor settings.