Multi-Node Scheduling (JobSet and PMIx)
Two components support multi-node jobs on Tenstorrent hardware:
JobSet (kubernetes-sigs/jobset) groups related Jobs into a single managed unit, so a multi-node run is created, scheduled, and cleaned up as one object.
kubepmix is a mutating admission webhook that injects PMIx environment variables into participating Jobs and pods, wiring up the process management interface that multi-node ranks use to find each other.
Prerequisite
The kubepmix webhook is served over TLS issued by cert-manager, so cert-manager
must be installed before tt-operator. See Prerequisites.
The webhook resources are created in the kube-pmix namespace by default, set by
kubepmix.namespace.
Opting a Job in
The webhook only mutates resources that carry the opt-in label, so it has no effect on unrelated workloads. Label the pods that should receive PMIx wiring:
metadata:
labels:
kubepmix.dev/enabled: "true"
Verify
kubectl get crd jobsets.jobset.x-k8s.io
kubectl -n kube-pmix get deploy kube-pmix
kubectl get mutatingwebhookconfiguration kube-pmix
Note
The components and webhook wiring are available now. Running a real co-scheduled
multi-node job end to end is maturing. To install without them, set
--set jobset.enabled=false and --set kubepmix.enabled=false.