tt_ipmi_service_healthy
Name
Prometheus Metric Name
tt_ipmi_service_healthy
Metric Path (tt-telemetry)
Schema:
{hostname}/telemetry/tt_ipmi_service_healthy
Example path:
bh-glx-c09u02/telemetry/tt_ipmi_service_healthy
Description
A meta metric that reports the health of the IPMI I2C polling service used by
tt_ethernet_cable_present and tt_cable_present. On hosts that actually use
IPMI (i.e. Blackhole Galaxy hosts with QSFP-DD port metrics), the telemetry
server runs a background thread that scans every QSFP-DD port via ipmitool
approximately every 60 seconds and caches the per-port status in memory.
This metric is a simple binary health signal:
True:
the service was never started (hosts that don’t need IPMI — no QSFP-DD port metrics — report healthy by default),
or the service has started but its first scan hasn’t finished yet (optimistic startup state),
or the most recent scan probed every port with zero read failures (no timeouts, no BMC errors — empty sockets are detected cleanly and do not count as failures).
False: the service attempted to start but failed (e.g.,
ipmitoolis not installed, permission denied, IPMI kernel modules missing), or the most recent scan had at least one port whose I2C read genuinely errored (timeout, BMC unresponsive, I2C bus fault, etc.).
When this metric is false on a host that uses IPMI, the cable-present
metrics may be stale for some ports (per-port entries retain their last
successfully-read value). When it is true, the cable-present data is
up-to-date as of the most recent poll cycle (or the metric is still in its
startup window, or IPMI is not being used on this host).
Underlying mechanics
Each
ipmitoolinvocation is wrapped withtimeout 5sso a single stuck I2C read cannot wedge the polling thread indefinitely.Disconnected ports are detected via the EEPROM
rsp=0xffresponse (either on the success path or embedded in a write-NAK error message) and do not count as read failures — they are normalfalsereadings oftt_ethernet_cable_present/tt_cable_present.The metric is emitted on every host where device telemetry and IPMI are enabled. The background poller is started lazily the first time a cable-present metric queries it; this metric’s own reads do not start the poller. So on hosts without QSFP-DD metrics the service is never started and this metric reports
true(healthy by default).
Values
Type: Boolean
Units: None
Allowable values:
True (1): Service is healthy, hasn’t yet been started, or hasn’t yet produced a scan result.
False (0): Service attempted to start and failed, or the most recent scan had one or more legitimate read failures.
Prometheus Labels
Label Name |
Value |
|---|---|
hostname |
The host from which the metric was collected. |