# tt_ipmi_service_healthy [<< Home](../index.md) | [<< Metrics](../metrics.md) ## Name ### Prometheus Metric Name ``` tt_ipmi_service_healthy ``` ### Metric Path (tt-telemetry) Schema: ``` {hostname}/telemetry/tt_ipmi_service_healthy ``` Example path: ``` bh-glx-c09u02/telemetry/tt_ipmi_service_healthy ``` ## Description A meta metric that reports the health of the IPMI I2C polling service used by `tt_ethernet_cable_present` and `tt_cable_present`. On hosts that actually use IPMI (i.e. Blackhole Galaxy hosts with QSFP-DD port metrics), the telemetry server runs a background thread that scans every QSFP-DD port via `ipmitool` approximately every 60 seconds and caches the per-port status in memory. This metric is a simple binary health signal: - **True**: - the service was never started (hosts that don't need IPMI — no QSFP-DD port metrics — report healthy by default), - *or* the service has started but its first scan hasn't finished yet (optimistic startup state), - *or* the most recent scan probed every port with zero read failures (no timeouts, no BMC errors — empty sockets are detected cleanly and do not count as failures). - **False**: the service attempted to start but failed (e.g., `ipmitool` is not installed, permission denied, IPMI kernel modules missing), or the most recent scan had at least one port whose I2C read genuinely errored (timeout, BMC unresponsive, I2C bus fault, etc.). When this metric is `false` on a host that uses IPMI, the cable-present metrics may be stale for some ports (per-port entries retain their last successfully-read value). When it is `true`, the cable-present data is up-to-date as of the most recent poll cycle (or the metric is still in its startup window, or IPMI is not being used on this host). ### Underlying mechanics - Each `ipmitool` invocation is wrapped with `timeout 5s` so a single stuck I2C read cannot wedge the polling thread indefinitely. - Disconnected ports are detected via the EEPROM `rsp=0xff` response (either on the success path or embedded in a write-NAK error message) and do **not** count as read failures — they are normal `false` readings of `tt_ethernet_cable_present` / `tt_cable_present`. - The metric is emitted on every host where device telemetry and IPMI are enabled. The background poller is started lazily the first time a cable-present metric queries it; this metric's own reads do not start the poller. So on hosts without QSFP-DD metrics the service is never started and this metric reports `true` (healthy by default). ## Values **Type:** Boolean **Units:** None **Allowable values:** - **True (1)**: Service is healthy, hasn't yet been started, or hasn't yet produced a scan result. - **False (0)**: Service attempted to start and failed, or the most recent scan had one or more legitimate read failures. ## Prometheus Labels |Label Name|Value| |---|---| |hostname|The host from which the metric was collected.|