tt_reset_count
Name
Prometheus Metric Name
tt_reset_count
Metric Path (tt-telemetry)
Schema:
{hostname}/telemetry/tt_reset_count
Example path:
bh-glx-c09u02/telemetry/tt_reset_count
Description
Cumulative count of warm resets detected since the telemetry server started. A warm reset is triggered externally (e.g., via tt-smi -r) and causes the driver and device metrics to be torn down and reinitialized. This counter is incremented each time a reset is detected, regardless of whether the subsequent driver reinitialization succeeds.
This metric is not created if telemetry collection is disabled.
Values
Type: Unsigned Integer
Units: None
Allowable values: Non-negative integer, monotonically increasing. Starts at 0 on server startup.
Prometheus Labels
Label Name |
Value |
|---|---|
hostname |
The host from which the metric was collected. |
hall |
The datacenter hall where the host is located. Sourced from the Factory System Descriptor (FSD). |
aisle |
The datacenter aisle where the host is located. Sourced from the Factory System Descriptor (FSD). |
rack |
The rack number where the host is located. Sourced from the Factory System Descriptor (FSD). |
shelf_u |
The shelf U position in the rack where the host is located. Sourced from the Factory System Descriptor (FSD). |