tt_pcie_link_alive
Name
Prometheus Metric Name
tt_pcie_link_alive
Metric Path (tt-telemetry)
Schema:
{hostname}/tray{tray}/chip{chip}/pcie/tt_pcie_link_alive
Example path:
bh-glx-c09u02/tray1/chip2/pcie/tt_pcie_link_alive
Description
Indicates whether the chip’s PCIe link responds to a host read. Each collection cycle the telemetry
server asks UMD’s hang detector to read a BAR register that the chip is guaranteed to never legitimately
hold as 0xFFFFFFFF. If the read returns 0xFFFFFFFF, the PCIe link has silently dropped and any
subsequent reads will also return the all-ones fault signature — recovery usually requires a board
reset.
This metric is only created for MMIO-capable chips on Wormhole and Blackhole architectures, since those are the devices for which UMD provides a PCIe hang detector. Remote chips and other architectures are skipped (no metric is emitted).
Values
Type: Boolean
Units: None
Allowable values:
True (1): The chip responded normally to the PCIe probe read.
False (0): The chip returned the
0xFFFFFFFFfault signature; the PCIe link is hung.
Prometheus Labels
Label Name |
Value |
|---|---|
hostname |
The host from which the metric was collected. |
hall |
The datacenter hall where the host is located. Sourced from the Factory System Descriptor (FSD). |
aisle |
The datacenter aisle where the host is located. Sourced from the Factory System Descriptor (FSD). |
rack |
The rack number where the host is located. Sourced from the Factory System Descriptor (FSD). |
shelf_u |
The shelf U position in the rack where the host is located. Sourced from the Factory System Descriptor (FSD). |
tray |
The tray (UBB) that the device is located on. |
chip |
The ASIC location within the tray. |