Metrics

<< Home

ARC Firmware

Metric

Description

tt_ai_clock_mhz

AI clock frequency in MHz.

tt_arc_heartbeat

Whether the ARC processor heartbeat changed since the last update.

tt_asic_temperature_celsius

ASIC die temperature in degrees Celsius.

tt_board_temperature_celsius

Board (PCB) temperature in degrees Celsius.

tt_cm_firmware_info

Control Module (CM) firmware version.

tt_dm_app_firmware_info

Device Management application firmware version.

tt_dm_bootloader_firmware_info

Device Management bootloader firmware version.

tt_ethernet_firmware_info

Ethernet firmware version.

tt_fan_speed_percent

Fan speed as a percentage of maximum.

tt_fan_speed_rpm

Fan speed in RPM.

tt_firmware_bundle_info

Firmware bundle version.

tt_flash_info

Flash firmware version.

tt_tdc_a

Thermal Design Current in amperes.

tt_tdp_w

Thermal Design Power in watts.

tt_thermal_trip_count

Cumulative thermal trip event count.

tt_vcore_v

Core voltage in volts.

Chip

Metric

Description

tt_chip_count

Number of Tenstorrent chips discovered on the host (PCI device enumeration).

tt_expected_chip_count

Expected chip count on Galaxy hosts (32); emitted only when board type is UBB or UBB_BLACKHOLE.

tt_noc_alive

Whether a given NOC (NOC0 or NOC1) responds to host-issued reads.

tt_pcie_link_alive

Whether the chip’s PCIe link responds to host-issued reads.

DRAM / GDDR

Metric

Description

tt_dram_corrected_edc_read_errors

Corrected EDC read errors per DRAM module.

tt_dram_corrected_edc_write_errors

Corrected EDC write errors per DRAM module.

tt_dram_speed_mbps

DRAM data rate in Mbps.

tt_dram_temperature_bottom_celsius

GDDR bottom die temperature.

tt_dram_temperature_top_celsius

GDDR top die temperature.

tt_dram_trained

Whether DRAM training succeeded.

tt_gddr_firmware_info

GDDR firmware version.

Memory

Memory-allocator metrics report DRAM usage from tt-metal’s per-device shared-memory region (/dev/shm/tt_device_<asic_id>_memory). tt-telemetry maps each region read-only and validates its layout version against the value it was built against; on a version mismatch the metric reports 0 and a one-shot warning is logged. If no tt-metal process has touched a chip on this host the SHM file does not yet exist and the metric silently reports 0.

Per-chip metrics (tt_device_*) are reported under tray{tray}/chip{chip}/dram/. Host-level aggregates (tt_dram_*) sum across every MMIO-capable chip on the host and are reported under dram/.

For multi-chip mesh devices (e.g. N300, Galaxy) the shared-memory region currently aggregates allocations across the gateway chip and any remote chips reached through it, all reported under the gateway’s tray/chip labels. Per-chip breakdown via chip_stats[] is a planned follow-up.

Metric

Description

tt_device_dram_total_megabytes

Per-chip total DRAM capacity, in mebibytes.

tt_device_dram_used_megabytes

Per-chip DRAM currently allocated by tt-metal across all attached processes, in mebibytes.

tt_dram_total_megabytes

Host-wide sum of DRAM capacity across all MMIO-capable chips, in mebibytes.

tt_dram_used_megabytes

Host-wide sum of DRAM currently allocated by tt-metal across all chips on this host, in mebibytes.

Ethernet

Metric

Description

tt_ethernet_cable_present

Whether a QSFP-DD cable is present on the physical port backing a given ethernet channel.

tt_ethernet_corrected_codeword_count

Corrected codewords (FEC). Wormhole B0 and Blackhole.

tt_ethernet_crc_error_count

CRC errors per channel. Wormhole B0 only.

tt_ethernet_error_status

Raw ETH_CTRL ERR_STAT register value per channel. Blackhole only.

tt_eth_firmware_signature

ERISC firmware signature read from the heartbeat word. Identifies whether base or fabric firmware is running on the core.

tt_ethernet_heartbeat

Ethernet firmware heartbeat status.

tt_ethernet_link_up

Ethernet link up/down status.

tt_ethernet_retrain_count

Link retraining event count.

tt_ethernet_rxq_packet_drop_count

Dropped packets per RX queue. Blackhole only.

tt_ethernet_txq_resend_count

Packet resends per TX queue. Blackhole only.

tt_ethernet_uncorrected_codeword_count

Uncorrected codewords (FEC). Wormhole B0 and Blackhole.

QSFP

Per-physical-port QSFP-DD metrics, organized by tray and connector rather than by ethernet channel. Requires IPMI access on the host and port-type information from a Factory System Descriptor.

Metric

Description

tt_cable_present

Whether a QSFP-DD cable is physically connected at a specific tray-level port.

Host

Host-level metrics report information about the host system and its driver environment. They do not require device initialization.

Metric

Description

tt_kmd_info

Installed kernel-mode driver (KMD) version.

Fabric

Fabric metrics are only updated by fabric firmware when workloads are run with fabric telemetry explicitly enabled. For example, by setting the environment variable TT_METAL_FABRIC_TELEMETRY to 1.

Metric

Description

tt_fabric_config

Fabric routing configuration value.

tt_fabric_device_id

Fabric device ID within the mesh.

tt_fabric_direction

Fabric direction configuration.

tt_fabric_mesh_id

Fabric mesh network ID.

tt_fabric_neighbor_device_id

Neighbor fabric device ID across the link.

tt_fabric_neighbor_mesh_id

Neighbor fabric mesh ID across the link.

tt_fabric_router_state

Router state of a specific eRISC core.

tt_fabric_rx_active_bandwidth_megabytes_per_second

RX bandwidth over active cycles (MB/s).

tt_fabric_rx_bandwidth_megabytes_per_second

RX bandwidth over all cycles (MB/s).

tt_fabric_rx_bytes_total

Total bytes received.

tt_fabric_rx_heartbeat_total

eRISC RX heartbeat counter.

tt_fabric_rx_max_bandwidth_megabytes_per_second

Peak RX bandwidth (MB/s).

tt_fabric_rx_packets_total

Total packets received.

tt_fabric_supported_stats

Bitmask of supported fabric statistics.

tt_fabric_tx_active_bandwidth_megabytes_per_second

TX bandwidth over active cycles (MB/s).

tt_fabric_tx_bandwidth_megabytes_per_second

TX bandwidth over all cycles (MB/s).

tt_fabric_tx_bytes_total

Total bytes transmitted.

tt_fabric_tx_heartbeat_total

eRISC TX heartbeat counter.

tt_fabric_tx_max_bandwidth_megabytes_per_second

Peak TX bandwidth (MB/s).

tt_fabric_tx_packets_total

Total packets transmitted.

tt_fabric_version

Fabric telemetry protocol version.

Meta

Meta metrics assess the state of the telemetry collection system itself. They are not collected from devices. However, they can surface hardware problems indirectly when they indicate that collection is not occurring.

Metric

Description

tt_all_devices_accessible

Whether all devices are accessible.

tt_all_devices_metrics_readable_percent

Percentage of readable metrics across all devices.

tt_device_accessible

Whether a specific device is accessible.

tt_device_metrics_readable_percent

Percentage of readable metrics for a specific device.

tt_driver_initialized

Whether the device driver is initialized.

tt_ipmi_service_healthy

Health of the IPMI I2C polling service used for QSFP cable presence.

tt_is_connected

Whether connected to a collector endpoint.

tt_last_update_timestamp

Timestamp of last completed sampling cycle.

tt_reset_count

Cumulative warm reset count.

tt_telemetry_collection_interval_seconds

Telemetry collection interval in seconds.

tt_telemetry_info

Telemetry server version.