Metrics
ARC Firmware
Metric |
Description |
|---|---|
AI clock frequency in MHz. |
|
Whether the ARC processor heartbeat changed since the last update. |
|
ASIC die temperature in degrees Celsius. |
|
Board (PCB) temperature in degrees Celsius. |
|
Control Module (CM) firmware version. |
|
Device Management application firmware version. |
|
Device Management bootloader firmware version. |
|
Ethernet firmware version. |
|
Fan speed as a percentage of maximum. |
|
Fan speed in RPM. |
|
Firmware bundle version. |
|
Flash firmware version. |
|
Thermal Design Current in amperes. |
|
Thermal Design Power in watts. |
|
Cumulative thermal trip event count. |
|
Core voltage in volts. |
Chip
Metric |
Description |
|---|---|
Number of Tenstorrent chips discovered on the host (PCI device enumeration). |
|
Expected chip count on Galaxy hosts (32); emitted only when board type is UBB or UBB_BLACKHOLE. |
|
Whether a given NOC (NOC0 or NOC1) responds to host-issued reads. |
|
Whether the chip’s PCIe link responds to host-issued reads. |
DRAM / GDDR
Metric |
Description |
|---|---|
Corrected EDC read errors per DRAM module. |
|
Corrected EDC write errors per DRAM module. |
|
DRAM data rate in Mbps. |
|
GDDR bottom die temperature. |
|
GDDR top die temperature. |
|
Whether DRAM training succeeded. |
|
GDDR firmware version. |
Memory
Memory-allocator metrics report DRAM usage from tt-metal’s per-device shared-memory region (/dev/shm/tt_device_<asic_id>_memory). tt-telemetry maps each region read-only and validates its layout version against the value it was built against; on a version mismatch the metric reports 0 and a one-shot warning is logged. If no tt-metal process has touched a chip on this host the SHM file does not yet exist and the metric silently reports 0.
Per-chip metrics (tt_device_*) are reported under tray{tray}/chip{chip}/dram/. Host-level aggregates (tt_dram_*) sum across every MMIO-capable chip on the host and are reported under dram/.
For multi-chip mesh devices (e.g. N300, Galaxy) the shared-memory region currently aggregates allocations across the gateway chip and any remote chips reached through it, all reported under the gateway’s tray/chip labels. Per-chip breakdown via chip_stats[] is a planned follow-up.
Metric |
Description |
|---|---|
Per-chip total DRAM capacity, in mebibytes. |
|
Per-chip DRAM currently allocated by tt-metal across all attached processes, in mebibytes. |
|
Host-wide sum of DRAM capacity across all MMIO-capable chips, in mebibytes. |
|
Host-wide sum of DRAM currently allocated by tt-metal across all chips on this host, in mebibytes. |
Ethernet
Metric |
Description |
|---|---|
Whether a QSFP-DD cable is present on the physical port backing a given ethernet channel. |
|
Corrected codewords (FEC). Wormhole B0 and Blackhole. |
|
CRC errors per channel. Wormhole B0 only. |
|
Raw ETH_CTRL ERR_STAT register value per channel. Blackhole only. |
|
ERISC firmware signature read from the heartbeat word. Identifies whether base or fabric firmware is running on the core. |
|
Ethernet firmware heartbeat status. |
|
Ethernet link up/down status. |
|
Link retraining event count. |
|
Dropped packets per RX queue. Blackhole only. |
|
Packet resends per TX queue. Blackhole only. |
|
Uncorrected codewords (FEC). Wormhole B0 and Blackhole. |
QSFP
Per-physical-port QSFP-DD metrics, organized by tray and connector rather than by ethernet channel. Requires IPMI access on the host and port-type information from a Factory System Descriptor.
Metric |
Description |
|---|---|
Whether a QSFP-DD cable is physically connected at a specific tray-level port. |
Host
Host-level metrics report information about the host system and its driver environment. They do not require device initialization.
Metric |
Description |
|---|---|
Installed kernel-mode driver (KMD) version. |
Fabric
Fabric metrics are only updated by fabric firmware when workloads are run with fabric telemetry explicitly enabled. For example, by setting the environment variable TT_METAL_FABRIC_TELEMETRY to 1.
Metric |
Description |
|---|---|
Fabric routing configuration value. |
|
Fabric device ID within the mesh. |
|
Fabric direction configuration. |
|
Fabric mesh network ID. |
|
Neighbor fabric device ID across the link. |
|
Neighbor fabric mesh ID across the link. |
|
Router state of a specific eRISC core. |
|
RX bandwidth over active cycles (MB/s). |
|
RX bandwidth over all cycles (MB/s). |
|
Total bytes received. |
|
eRISC RX heartbeat counter. |
|
Peak RX bandwidth (MB/s). |
|
Total packets received. |
|
Bitmask of supported fabric statistics. |
|
TX bandwidth over active cycles (MB/s). |
|
TX bandwidth over all cycles (MB/s). |
|
Total bytes transmitted. |
|
eRISC TX heartbeat counter. |
|
Peak TX bandwidth (MB/s). |
|
Total packets transmitted. |
|
Fabric telemetry protocol version. |
Meta
Meta metrics assess the state of the telemetry collection system itself. They are not collected from devices. However, they can surface hardware problems indirectly when they indicate that collection is not occurring.
Metric |
Description |
|---|---|
Whether all devices are accessible. |
|
Percentage of readable metrics across all devices. |
|
Whether a specific device is accessible. |
|
Percentage of readable metrics for a specific device. |
|
Whether the device driver is initialized. |
|
Health of the IPMI I2C polling service used for QSFP cable presence. |
|
Whether connected to a collector endpoint. |
|
Timestamp of last completed sampling cycle. |
|
Cumulative warm reset count. |
|
Telemetry collection interval in seconds. |
|
Telemetry server version. |