tt_noc_alive
Name
Prometheus Metric Name
tt_noc_alive
Metric Path (tt-telemetry)
Schema:
{hostname}/tray{tray}/chip{chip}/noc{noc}/tt_noc_alive
Example path:
bh-glx-c09u02/tray1/chip2/noc0/tt_noc_alive
Description
Indicates whether a single NOC (Network-on-Chip) responds to a read issued from the host. Each
collection cycle the telemetry server asks UMD’s hang detector to issue a NOC read of a node-ID
register that is guaranteed to never legitimately hold 0xFFFFFFFF. If the read returns the
all-ones fault signature, that NOC is considered hung; reset usually requires a board power-cycle.
The metric is emitted once per NOC on every MMIO-capable chip that runs a Wormhole or Blackhole
architecture (those are the devices for which UMD provides a NOC hang detector). The noc path
segment and label identify which NOC the measurement came from (0 for NOC0, 1 for NOC1).
Values
Type: Boolean
Units: None
Allowable values:
True (1): The NOC read returned a sane value; the NOC is responsive.
False (0): The NOC read returned
0xFFFFFFFF; the NOC appears hung.
Prometheus Labels
Label Name |
Value |
|---|---|
hostname |
The host from which the metric was collected. |
hall |
The datacenter hall where the host is located. Sourced from the Factory System Descriptor (FSD). |
aisle |
The datacenter aisle where the host is located. Sourced from the Factory System Descriptor (FSD). |
rack |
The rack number where the host is located. Sourced from the Factory System Descriptor (FSD). |
shelf_u |
The shelf U position in the rack where the host is located. Sourced from the Factory System Descriptor (FSD). |
tray |
The tray (UBB) that the device is located on. |
chip |
The ASIC location within the tray. |
noc |
The NOC identifier on the chip ( |