# tt_ipmi_service_healthy

[<< Home](../index.md) | [<< Metrics](../metrics.md)

## Name

### Prometheus Metric Name

```
tt_ipmi_service_healthy
```

### Metric Path (tt-telemetry)

Schema:

```
{hostname}/telemetry/tt_ipmi_service_healthy
```

Example path:

```
bh-glx-c09u02/telemetry/tt_ipmi_service_healthy
```

## Description

A meta metric that reports the health of the IPMI I2C polling service used by
`tt_ethernet_cable_present` and `tt_cable_present`. On hosts that actually use
IPMI (i.e. Blackhole Galaxy hosts with QSFP-DD port metrics), the telemetry
server runs a background thread that scans every QSFP-DD port via `ipmitool`
approximately every 60 seconds and caches the per-port status in memory.

This metric is a simple binary health signal:

- **True**:
  - the service was never started (hosts that don't need IPMI — no QSFP-DD
    port metrics — report healthy by default),
  - *or* the service has started but its first scan hasn't finished yet
    (optimistic startup state),
  - *or* the most recent scan probed every port with zero read failures (no
    timeouts, no BMC errors — empty sockets are detected cleanly and do not
    count as failures).
- **False**: the service attempted to start but failed (e.g., `ipmitool` is
  not installed, permission denied, IPMI kernel modules missing), or the most
  recent scan had at least one port whose I2C read genuinely errored (timeout,
  BMC unresponsive, I2C bus fault, etc.).

When this metric is `false` on a host that uses IPMI, the cable-present
metrics may be stale for some ports (per-port entries retain their last
successfully-read value). When it is `true`, the cable-present data is
up-to-date as of the most recent poll cycle (or the metric is still in its
startup window, or IPMI is not being used on this host).

### Underlying mechanics

- Each `ipmitool` invocation is wrapped with `timeout 5s` so a single stuck
  I2C read cannot wedge the polling thread indefinitely.
- Disconnected ports are detected via the EEPROM `rsp=0xff` response (either
  on the success path or embedded in a write-NAK error message) and do **not**
  count as read failures — they are normal `false` readings of
  `tt_ethernet_cable_present` / `tt_cable_present`.
- The metric is emitted on every host where device telemetry and IPMI are
  enabled. The background poller is started lazily the first time a
  cable-present metric queries it; this metric's own reads do not start the
  poller. So on hosts without QSFP-DD metrics the service is never started
  and this metric reports `true` (healthy by default).

## Values

**Type:** Boolean

**Units:** None

**Allowable values:**
- **True (1)**: Service is healthy, hasn't yet been started, or hasn't yet
  produced a scan result.
- **False (0)**: Service attempted to start and failed, or the most recent
  scan had one or more legitimate read failures.

## Prometheus Labels

|Label Name|Value|
|---|---|
|hostname|The host from which the metric was collected.|