tt_ipmi_service_healthy

<< Home | << Metrics

Name

Prometheus Metric Name

tt_ipmi_service_healthy

Metric Path (tt-telemetry)

Schema:

{hostname}/telemetry/tt_ipmi_service_healthy

Example path:

bh-glx-c09u02/telemetry/tt_ipmi_service_healthy

Description

A meta metric that reports the health of the IPMI I2C polling service used by tt_ethernet_cable_present and tt_cable_present. On hosts that actually use IPMI (i.e. Blackhole Galaxy hosts with QSFP-DD port metrics), the telemetry server runs a background thread that scans every QSFP-DD port via ipmitool approximately every 60 seconds and caches the per-port status in memory.

This metric is a simple binary health signal:

  • True:

    • the service was never started (hosts that don’t need IPMI — no QSFP-DD port metrics — report healthy by default),

    • or the service has started but its first scan hasn’t finished yet (optimistic startup state),

    • or the most recent scan probed every port with zero read failures (no timeouts, no BMC errors — empty sockets are detected cleanly and do not count as failures).

  • False: the service attempted to start but failed (e.g., ipmitool is not installed, permission denied, IPMI kernel modules missing), or the most recent scan had at least one port whose I2C read genuinely errored (timeout, BMC unresponsive, I2C bus fault, etc.).

When this metric is false on a host that uses IPMI, the cable-present metrics may be stale for some ports (per-port entries retain their last successfully-read value). When it is true, the cable-present data is up-to-date as of the most recent poll cycle (or the metric is still in its startup window, or IPMI is not being used on this host).

Underlying mechanics

  • Each ipmitool invocation is wrapped with timeout 5s so a single stuck I2C read cannot wedge the polling thread indefinitely.

  • Disconnected ports are detected via the EEPROM rsp=0xff response (either on the success path or embedded in a write-NAK error message) and do not count as read failures — they are normal false readings of tt_ethernet_cable_present / tt_cable_present.

  • The metric is emitted on every host where device telemetry and IPMI are enabled. The background poller is started lazily the first time a cable-present metric queries it; this metric’s own reads do not start the poller. So on hosts without QSFP-DD metrics the service is never started and this metric reports true (healthy by default).

Values

Type: Boolean

Units: None

Allowable values:

  • True (1): Service is healthy, hasn’t yet been started, or hasn’t yet produced a scan result.

  • False (0): Service attempted to start and failed, or the most recent scan had one or more legitimate read failures.

Prometheus Labels

Label Name

Value

hostname

The host from which the metric was collected.