tt_reset_count

<< Home | << Metrics

Name

Prometheus Metric Name

tt_reset_count

Metric Path (tt-telemetry)

Schema:

{hostname}/telemetry/tt_reset_count

Example path:

bh-glx-c09u02/telemetry/tt_reset_count

Description

Cumulative count of warm resets detected since the telemetry server started. A warm reset is triggered externally (e.g., via tt-smi -r) and causes the driver and device metrics to be torn down and reinitialized. This counter is incremented each time a reset is detected, regardless of whether the subsequent driver reinitialization succeeds.

This metric is not created if telemetry collection is disabled.

Values

Type: Unsigned Integer

Units: None

Allowable values: Non-negative integer, monotonically increasing. Starts at 0 on server startup.

Prometheus Labels

Label Name

Value

hostname

The host from which the metric was collected.

hall

The datacenter hall where the host is located. Sourced from the Factory System Descriptor (FSD).

aisle

The datacenter aisle where the host is located. Sourced from the Factory System Descriptor (FSD).

rack

The rack number where the host is located. Sourced from the Factory System Descriptor (FSD).

shelf_u

The shelf U position in the rack where the host is located. Sourced from the Factory System Descriptor (FSD).