tt_thermal_trip_count

<< Home | << Metrics

Name

Prometheus Metric Name

tt_thermal_trip_count

Metric Path (tt-telemetry)

Schema:

{hostname}/tray{tray}/chip{chip}/tt_thermal_trip_count

Example path:

bh-glx-c09u02/tray1/chip2/tt_thermal_trip_count

Description

The cumulative count of thermal trip events since the last reset (system reboot or tt-smi reset). A non-zero value indicates the device has experienced thermal throttling events. The chip temperature is monitored and when it exceeds safe thresholds, a thermal trip is triggered. This metric is only created if the device reports thermal trip data.

Values

Type: Unsigned Integer

Units: None

Allowable values: A non-negative integer representing the number of thermal trip events. A value of 0 means no thermal trips have occurred since the last reset.

Prometheus Labels

Label Name

Value

hostname

The host from which the metric was collected.

hall

The datacenter hall where the host is located. Sourced from the Factory System Descriptor (FSD).

aisle

The datacenter aisle where the host is located. Sourced from the Factory System Descriptor (FSD).

rack

The rack number where the host is located. Sourced from the Factory System Descriptor (FSD).

shelf_u

The shelf U position in the rack where the host is located. Sourced from the Factory System Descriptor (FSD).

tray

The tray (UBB) that the device is located on.

chip

The ASIC location within the tray.