Architecture

                          ┌──────────────────────────────┐
                          │       Orchestrator           │
                          └──────────────┬───────────────┘
                                         │ gRPC
                          ┌──────────────▼─────────────────┐
                          │         Controller             │
                          │  ┌──────────────────────────┐  │
                          │  │  Orchestrator Service    │  │
                          │  │  - QueryPhysicalTopology │  │
                          │  │  - GetValidPlacementsMGD │  │
                          │  └──────────────────────────┘  │
                          │  ┌────────────────────────┐    │
                          │  │  Topology Mapper (CSP) │    │
                          │  └────────────────────────┘    │
                          │  ┌─────────────────────────┐   │
                          │  │  Daemon Service         │   │
                          │  │  - RegisterDaemon       │   │
                          │  │  - HeartbeatStream      │   │
                          │  └────────────┬────────────┘   │
                          └───────────────┼────────────────┘
                                          │ gRPC
                    ┌─────────────────────┼─────────────────────┐
                    │                     │                     │
         ┌──────────▼─────────┐  ┌────────▼───────────┐ ┌───────▼───────────┐
         │   Agent (Host 1)   │  │  Agent (Host 2)    │ │  Agent (Host N)   │
         │                    │  │                    │ │                   │
         │  Device Discovery  │  │  Device Discovery  │ │  Device Discovery │
         │  (UMD)             │  │  (UMD)             │ │  (UMD)            │
         └────────────────────┘  └────────────────────┘ └───────────────────┘

Components

  • Agent (tt-fabric-manager-agent): Runs on each host. Uses UMD to discover local Tenstorrent ASICs (unique IDs, board type, arch, memory, PCI address), intra-host ethernet connections, and cross-host exit nodes. Registers topology with the controller and maintains a bidirectional heartbeat stream.

  • Controller (tt-fabric-manager-controller): Centralized coordinator. Aggregates physical topology from all registered agents, tracks host health via heartbeats, and exposes an orchestrator-facing gRPC API. Uses tt-metalium’s CSP solver to map logical mesh descriptors onto physical ASICs.

Data Flow — Startup & Registration

  Agent (Host N)                          Controller
  ─────────────                          ──────────
       │                                      │
       │  1. Discover local ASICs (UMD)       │
       │◄─────────────────────┐               │
       │                      │               │
       │  2. RegisterDaemon(HostPhysicalTopology)
       │─────────────────────────────────────►│
       │                                      │  3. Store topology
       │              RegisterResponse        │     in memory
       │◄─────────────────────────────────────│
       │                                      │
       │  4. HeartbeatStream (bidirectional)  │
       │◄────────────────────────────────────►│
       │     - Periodic keepalive             │
       │     - Health monitoring              │
       │                                      │

Data Flow — Mesh Placement Query

  Orchestrator              Controller                  Agent(s)
  ────────────              ──────────                  ────────
       │                         │                          │
       │  GetValidPlacementsMGD  │                          │
       │  (MGD textproto)        │                          │
       │────────────────────────►│                          │
       │                         │                          │
       │                         │  Aggregate topology      │
       │                         │  from registered agents  │
       │                         │                          │
       │                         │  Run CSP mapper          │
       │                         │  (map_mesh_to_physical)  │
       │                         │                          │
       │  PlacementResponse      │                          │
       │  (host→ASIC assignments)│                          │
       │◄────────────────────────│                          │
       │                         │                          │