tt_noc_alive

<< Home | << Metrics

Name

Prometheus Metric Name

tt_noc_alive

Metric Path (tt-telemetry)

Schema:

{hostname}/tray{tray}/chip{chip}/noc{noc}/tt_noc_alive

Example path:

bh-glx-c09u02/tray1/chip2/noc0/tt_noc_alive

Description

Indicates whether a single NOC (Network-on-Chip) responds to a read issued from the host. Each collection cycle the telemetry server asks UMD’s hang detector to issue a NOC read of a node-ID register that is guaranteed to never legitimately hold 0xFFFFFFFF. If the read returns the all-ones fault signature, that NOC is considered hung; reset usually requires a board power-cycle.

The metric is emitted once per NOC on every MMIO-capable chip that runs a Wormhole or Blackhole architecture (those are the devices for which UMD provides a NOC hang detector). The noc path segment and label identify which NOC the measurement came from (0 for NOC0, 1 for NOC1).

Values

Type: Boolean

Units: None

Allowable values:

  • True (1): The NOC read returned a sane value; the NOC is responsive.

  • False (0): The NOC read returned 0xFFFFFFFF; the NOC appears hung.

Prometheus Labels

Label Name

Value

hostname

The host from which the metric was collected.

hall

The datacenter hall where the host is located. Sourced from the Factory System Descriptor (FSD).

aisle

The datacenter aisle where the host is located. Sourced from the Factory System Descriptor (FSD).

rack

The rack number where the host is located. Sourced from the Factory System Descriptor (FSD).

shelf_u

The shelf U position in the rack where the host is located. Sourced from the Factory System Descriptor (FSD).

tray

The tray (UBB) that the device is located on.

chip

The ASIC location within the tray.

noc

The NOC identifier on the chip (0 for NOC0, 1 for NOC1).