A curated directory of projects, tools, models, and research for Tenstorrent hardware — contributed by the community and our team. Browse by category or search across all entries.
Select an entry to see details
htop-style process monitor for GPUs and AI accelerators. Supports AMD, Apple, Huawei, Intel, NVIDIA, Qualcomm — and Tenstorrent. Real-time utilization, memory, and process info in a terminal UI.
Vendor-agnostic orchestration for training, inference, and agentic workloads across NVIDIA, AMD, TPU, and Tenstorrent on clouds, Kubernetes, and bare metal.
Open-source CUDA compiler targeting multiple GPU architectures including Tenstorrent. Compiles .cu files to run on AMD and Tenstorrent hardware without modification.
Minimal Python code to access and program the Tenstorrent Blackhole chip directly — George Hotz's exploration of TT hardware programmability with pointed commentary on the architecture.
Community-built Tenstorrent architecture simulator written in Python. Runs without hardware — useful for researchers and developers exploring the Tensix architecture offline.
IREE (Intermediate Representation Execution Environment) ML compiler ported to Tenstorrent AI accelerators. Brings the IREE compiler ecosystem to TT hardware.
OpenAI Triton compiler plugin for Tenstorrent hardware. Write Triton kernels and target Tensix cores — brings the Triton ML kernel ecosystem to TT devices.
Boot stock Linux cloud images on the SiFive X280 RISC-V cores inside Tenstorrent Blackhole AI accelerators. Per-card Rust daemon with virtio-mmio block/net/console and U-Boot/EFI support.
# Changelog Notable changes per release. Format loosely follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/); this project does not yet promise SemVer compatibility on the RPC wire format or library API surface (we're not 1.0). ## Unreleased V2 virtio-dispatch redesign. The kick ring + completion ring + host- side throttle that grew up around #184 are gone; in their place is a per-(slot, queue) dirty bitmap in BRISC L1. The bitmap is level- sensitive — guest QUEUE_NOTIFY storms coalesce into a single set byte, so the dispatch path can't fall behind under any burst. Wire incompatible with 0.9.0; `TENSIX_PROTOCOL_VERSION` bumped 4 → 5. ### Added - **V2 dirty-bitmap dispatch** (`#187` / `#188` / `#189`). BRISC writes 1 to `CTRL_OFF_DIRTY[slot][queue]` on every guest QUEUE_NOTIFY; the daemon's `Dispatcher` clears the byte and dispatches each pass. Replaces V1's 2048-entry kick ring + daemon-side `consume_kick_ring_pass` consumer. - **V2 processed-cursor table** at `CTRL_OFF_PROCESSED`. Daemon publishes `used.idx` after each successful dispatch so warm-resume reads cursors directly without re-probing guest DRAM. - **`bhx_notify_events_total`, `bhx_dispatch_passes_total`, `bhx_dispatch_queues_drained`** Prometheus counters surface the new dispatch path. The burst regression test (`scripts/ soak_virtio_burst.py`) asserts `dispatch_passes_total > 0` to confirm the workload reached the new path. - **`scripts/soak_virtio_burst.py`** — multi-queue burst regression test. Sustains 16-job direct=1 fio randwrite + a tight `printf` loop to `/dev/console`, samples `/metrics` every 1 s, and verifies the daemon log contains zero `kick.*drop|rescue|throttle.*ENGAGE` matches. - **`DaemonState.chip_reset_this_session`** flag — gates `maybe_opportunistic_reset_board` so 4-way parallel cold boots reset the chip exactly once, not once per L2CPU. Without this the second-and-later resets blip the chip while earlier-booted L2CPUs hold mmap pages, SIGBUSing their workers. - **`Dispatcher` (was `KickPoller`)** with documented testability seam (`CtrlL1Access` trait); `drain_dirty_bitmap` is unit-tested against an in-memory L1 fake covering all five visit/clear semantics cases plus the address-formula pins. ### Changed - **`KickPoller` → `Dispatcher`**, plus `kick_poller` → `dispatcher` field on `DaemonState`, `tensix-kick-poller` → `tensix-dispatcher` thread name, `[kick-poller]` → `[dispatcher]` log tag, `kicks_consumed` → `dispatches_total`, `last_kick_slot_queue` → `last_dispatch_slot_queue`. Pure rename; no behavior change. V1 vocabulary scrubbed throughout the codebase (firmware, daemon, scripts, docs). - **`CTRL_SIZE` shrinks 36 KiB → 4 KiB**. V2 footprint is ~1.5 KiB; the rest is reserved for future fields. - **Stats-page offsets repacked** — V1 `STATS_OFF_KICK_DROPS`, `STATS_OFF_COMPL_EVENTS`, `STATS_OFF_LAST_COMPL` retired with V1 (#190); deprecated PRECAP / BLINDCAP / POSTCAP slots dropp
Boltz-2 biomolecular model for drug discovery on Tenstorrent Blackhole. Supports single-card and multi-card configurations — QuietBox (4×) and Galaxy (32×). Approaches physics-based FEP accuracy at 1000× the speed.
Deep-dive into the Tenstorrent architecture and Metalium programming model — circular buffers, kernel synchronization, NoC routing, and where the footguns are. The honest guide to thinking in Tensix.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
A 6,500-word community deep dive into the Blackhole p100a architecture: the tile model (Tensix, DRAM, SiFive x280 L2CPU, Ethernet, PCIe, NoC arc), firmware startup sequence, MOP micro-op processor, replay buffer, FPU/SFPU sync, and the anatomy of a kernel. From the author of blackhole-py.
FlashAttention-style attention kernel implemented entirely in on-chip SRAM on the Tenstorrent Grayskull chip using TT-Metalium. Pioneering work in low-level attention on TT hardware.
A Tenstorrent Grayskull kernel written live on Twitch by George Hotz. 120-core grid demonstration of live kernel programming.
Example applications and deployment configurations for running AI workloads on Tenstorrent hardware via Koyeb's cloud platform.
Pure Python driver for Tenstorrent Blackhole cards providing direct low-level hardware access without going through the full TT-Metal stack.
Simple C++ kernel experiments on a GraySkull e75 chip. Hands-on examples for learning the TT-Metal programming model at the metal level.
Minimal working example of using Tenstorrent TTNN in C++. The simplest possible starting point for C++ developers targeting TT hardware with TTNN.
Conway's Game of Life implemented on Tenstorrent hardware using TT-Metal kernels.
Mandelbrot Set fractal renderer running on Tenstorrent hardware. A classic demo showcasing parallel compute on Tensix cores.
Minimal working CMake project template for starting a new TT-Metal project from scratch. Good starting point for community kernel development.
Tutorial on Tenstorrent hardware for HPC researchers from the RISC-V Testbed project at Edinburgh/EPCC. Covers Wormhole from an HPC parallel-computing perspective.
clpeak-style peak-performance benchmark for Tenstorrent devices using TT-Metalium. Measures theoretical peak throughput across operations — useful for hardware characterization.
Nix flake packaging the Tenstorrent software stack for NixOS and Nix users. Reproducible, declarative installation of TT drivers and tools.
High-level parallel programming framework for Tenstorrent accelerators, abstracting TT-Metal into a research-oriented programming model for parallel computation.
Minimal vector-addition example on Tenstorrent devices using TT-Metalium. A clean hello-world for the TT-Metal kernel programming model in C++.
ttas is a hacker-friendly assembler/disassembler for Tensix on Wormhole. It turns assembly into the exact 32-bit words the hardware runs, and turns binaries back into readable instructions using the same shared instruction table.
Comprehensive tutorials for the Tenstorrent software stack in Korean. Jupyter notebooks covering the full developer path from hardware setup to model inference.
Master's thesis implementing and benchmarking five allreduce algorithms (Swing, Recursive Doubling, Bandwidth Optimal, Latency Optimal, Shared Memory) on the Wormhole n150. Bandwidth Optimal achieved best performance, approaching within 2× of theoretical optimal.
Rust crate that exposes the TT-Metal host API through a C++ bridge via cxx.rs — covering device management, program/kernel creation (from source file or inline string), circular buffers, semaphores, runtime arguments, sharded buffers, and MeshDevice workflows, with hardware-backed integration tests.
Port of Gaussian Splatting (3D scene reconstruction from 2D images) to Tenstorrent hardware.
Step-by-step guide to getting a Tenstorrent card running on Arch Linux with the full Metalium stack. Practical troubleshooting from someone who did it the hard way first.
Honest field notes from getting a Grayskull card running and writing first Metalium kernels. Covers setup pitfalls, processor hangs, memory protection quirks, and what makes Metalium compelling despite early rough edges.
Lecture 20 from William & Mary's graduate Computer Architecture course. Frames Tenstorrent in the landscape between GPUs and TPUs, draws comparisons to Cerebras and SambaNova, then dives deep into the Wormhole chip and Tensix core: the 5 RISC-V core design, SFPU, NoC, and dataflow execution model.
A fused kernel for the Grayskull architecture implementing Transformer self-attention entirely within SRAM. Combines matrix multiply, attention score scaling, and Softmax without DRAM accesses, achieving significant speedups over non-fused implementations.
Ports the Cooley-Tukey FFT algorithm to the Wormhole n300 RISC-V accelerator. The Wormhole draws 8× less power and consumes 2.8× less energy than a 24-core Xeon Platinum for a 2D FFT. ISC 2025.
Evaluates the Tenstorrent Grayskull e75 RISC-V accelerator for matrix multiplication at reduced numerical precision (BFP8 and LoFi), a fundamental kernel in LLM inference computation.
Evaluates three strategies for scaling an N-body code across multiple Tenstorrent Wormhole accelerators. Builds on the established performance of single-card N-body work to explore parallelism via the on-chip NoC and multi-accelerator configurations.
Accelerates an astrophysical N-body simulation on the Wormhole n300. Achieves 2× speedup and 2× energy savings over a highly optimized CPU implementation. SC '25 Workshop.
Implements three numerical kernels and composes them into a conjugate gradient solver on Wormhole. Demonstrates AI accelerators merit consideration for HPC workloads traditionally dominated by CPUs and GPUs. 2026.
Explores stencil computation on the Grayskull PCIe RISC-V accelerator. Early academic work examining TT hardware for HPC stencil workloads. 2024.
Maps 2D 5-point stencil computations onto the Tenstorrent Wormhole RISC-V AI dataflow accelerator via two implementations: element-wise decomposition (Axpy) and matrix-multiplication reformulation (MatMul). Profiling shows the isolated Wormhole kernel is competitive with CPU execution, with PCIe transfers and initialization driving end-to-end overhead; Axpy achieves lower energy than the CPU baseline at large scales. Identifies architectural and software directions for making AI accelerators viable for HPC stencil workloads. 2025.
Makes multi-tenant NPU sharing practical for Blackhole-class hardware using polynomial-time allocation algorithms. Delivers up to 1.37× higher utilization and 1.14× faster workload completion. Up to 890,000× faster than NP-hard baselines.
Compiler system that automatically generates efficient dataflow plans for tile-based languages on spatial accelerators including Tenstorrent Wormhole. Exploits on-chip network forwarding between processing elements to reduce DRAM pressure.
Shows that Text-to-Speech inference on Tenstorrent Lightning V2 achieves 4× lower cost than NVIDIA L40S. Applies BlockFloat8 (BFP8) and low-fidelity (LoFi) precision strategies to TTS despite their greater numerical fragility compared to LLMs.
A Tenstorrent fork of Infocom's Zork I (and more!), running a Z-machine interpreter at least four different ways on TT hardware. The most fun you can have with an AI accelerator.
Three agentic projects running fully on-device: local AI agents on QuietBox 2, a coding assistant powered by Aider against a local inference server, and the OpenClaw AI assistant on QuietBox 2. No cloud APIs — all inference runs on TT hardware.
Three lesson-projects covering on-device video synthesis: frame-by-frame diffusion with tt-local-generator, native AnimateDiff video animation, and video generation on QuietBox 2. All run entirely on TT hardware with no cloud dependency.
Hardware topology visualizer for Tenstorrent chips — from individual chip to full cluster. Interactive JavaScript visualization of Tensix core layout and NoC connections.
# Changelog
All notable changes to tensix-viz are documented here.
## [1.1.0] - 2026-06-09
### Fixed
- **Heatmap: non-tensix cells no longer painted by heat overlay** (`src/chip.js` `_drawHeatmap`)
Commit 76dca80 added `coreType !== 'tensix'` guards to the pre-built artifacts but never to
the source. The guards are now in `src/chip.js` so the next build preserves them. Without this
fix, DRAM (col 5 on Wormhole), ETH (row 6 on Wormhole), and PCIe (col 8 on Blackhole) cells
were colored by the heatmap overlay and could inflate `maxVal`, compressing the visible range
for all tensix cells.
- **Memory overlay: stale phase not rendered after `reset()` on `showMemory: true` instances**
(`src/chip.js` `reset()` and constructor)
After calling `viz.activate(mode)` followed by `viz.reset()` on a canvas created with
`showMemory: true`, `_memPhase` retained the frozen `_mem` object from the animation closure.
`reset()` calls `render()` at the end, which caused `_drawMemoryLayer()` to run with stale data,
producing a faint DRAM glow and L1 fill bars on an otherwise blank chip. `reset()` now sets
`this._memPhase = null`; the field is also explicitly initialized to `null` in the constructor.
- **Canvas context: `getContext('2d')` moved to after canvas sizing**
(`src/chip.js` constructor)
The 2D context was obtained before `canvas.width`/`canvas.height` were assigned. Assigning to
`canvas.width` resets all context state per spec, making the early `getContext` call redundant
and inconsistent with the intent. `this.ctx` is now assigned after the sizing block so the
obtained context reflects the final dimensions.
### Added
- **Responsive canvas sizing** (`src/chip.js` constructor)
If `canvas.parentElement` exists and `clientWidth` is smaller than the canvas's intrinsic
`width` attribute, logical dimensions are capped to the container width and height is scaled
proportionally. Applies at construction time; re-create the instance for later resizes.
- **Float label boundary clamping** (overridden `render()`)
The floating tooltip label is now clamped so its pill box never overflows any canvas edge.
`rawCx`/`rawCy` are constrained by `Math.max(w/2+margin, Math.min(logicalW-w/2-margin, raw*))`.
## [1.0.0] - 2026-05-18
Initial public release.
Particle Life simulation on Tenstorrent hardware — an emergent-behavior N-body system where simple attraction/repulsion rules between species produce complex lifelike patterns. Cookbook recipe demonstrating parallel N-body compute on Tensix.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
A growing collection of models that use tt-lang for some or all of their implementation. Reference implementations for bringing modern models to the tt-lang DSL.
Sync your Tenstorrent Quietbox's RGB lighting to accelerator utilization status. Visual feedback for hardware activity in real time.
Gemma 4 language model implemented in tt-lang (e4b variant) for direct execution on Tenstorrent hardware.
tt-lang inference script for Oasis 500M — an interactive video world model running on Tenstorrent hardware via the tt-lang DSL.
Discover, load, and benchmark models with a GUI and TUI for tt-inference-server. Makes exploring available models on Tenstorrent hardware as easy as browsing a catalog.
A Tenstorrent-powered claw machine that rewards players with real prizes. The QuietBox 2 runs local AI inference to act as an agent controlling the claw hardware — the OpenClaw AI assistant lesson builds directly on this project.
DFlash: Block Diffusion for Flash Speculative Decoding on Tenstorrent hardware using tt-lang. Combines block diffusion with speculative decoding for faster inference.
DIAMOND: Atari game-playing agent implemented on Tenstorrent hardware via tt-lang. Diffusion-based world model for reinforcement learning.
A Tenstorrent port of the DeepSeek Engram model using tt-lang. Brings DeepSeek's memory-efficient architecture to TT hardware.
On-device image generation with Stable Diffusion XL running entirely on Tenstorrent hardware. Full inference pipeline with no cloud dependency.
Compile more than 100 models on tt-forge in a display format suitable for demos. Comprehensive showcase of tt-forge model compatibility.
End-to-end image classification project using TT-Forge — compile and run a PyTorch classification model on Tenstorrent hardware with no kernel authoring required.
Warp terminal plugin for Tenstorrent — integrates hardware status, model management, and developer workflows directly into the Warp terminal.
Interactive browser-based visualizer of the Tenstorrent Tensix grid architecture. Explore the NoC, core layout, and dataflow patterns without hardware — a great companion for learning kernel programming.
TT-Metalium implementation of Conway's Game of Life as a cookbook recipe. Each generation is a full parallel kernel dispatch over the grid — a clean introduction to stateful compute on Tensix cores.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Three hands-on TT-Metalium kernel recipes: a Mandelbrot fractal explorer, real-time audio signal processing pipeline, and custom image filter stack. Each recipe is a complete kernel project with full source in the lesson.
Linux demo for the Tenstorrent Blackhole P100/P150 card RISC-V cores. Boot a real Linux kernel on the 16 high-performance RISC-V cores built into the Blackhole chip.
Browser-based cloud console for exploring AI on Tenstorrent hardware. Run LLM inference, image and video generation, and browse the supported model catalog in-browser — backed by Tenstorrent accelerators. Cloud hardware access and advanced workflows (deployments, agents) available in staged rollout.
TT-NN operator library and TT-Metalium low-level kernel programming model. The primary SDK for developing on Tenstorrent hardware — from high-level tensor ops to bare-metal RISC-V kernels.
TT-BUDA: Tenstorrent's original Python compiler and runtime for AI workloads. Legacy stack — tt-forge is the recommended successor, but tt-buda has the largest model demo library.
Tenstorrent's MLIR-based compiler frontend. Enables running AI workloads from PyTorch, ONNX, and other frameworks on all Tenstorrent hardware configurations through an open-source, general, and performant compiler.
Tenstorrent MLIR compiler — the core compiler infrastructure shared by tt-forge and other frontends. Handles graph optimization, lowering, and code generation for Tensix hardware.
The Berkeley Out-of-Order Machine with V-EXT (RISC-V Vector Extension) support. Tenstorrent's research-grade out-of-order RISC-V core with vector extension.
Fast full-system simulator of Tenstorrent Wormhole and Blackhole hardware. Runs TT-Metalium workloads on any Linux/x86_64 system without physical silicon. Bit-exact results relative to hardware.
RISC-V Instruction Set Simulator (ISS) used by Tenstorrent for processor verification. Powers the co-simulation architecture checker.
PJRT device plugin for Tenstorrent hardware. Enables JAX, PyTorch/XLA, and other XLA-based frameworks to target TT accelerators.
RISC-V Directed Test Framework and Compliance Suite. Comprehensive test infrastructure for verifying RISC-V processor implementations against the specification.
Tenstorrent kernel module driver. The Linux kernel module required to interface with Tenstorrent PCIe accelerator cards.
Repository of model demos using TT-Buda. The largest collection of pre-compiled model examples for Tenstorrent hardware — BERT, ResNet, YOLO, GPT-2, Whisper, and many more.
ONNX graph compiler for Tenstorrent hardware. Optimizes and transforms ONNX model graphs for efficient execution on Tensix accelerators. Used as a backend by tt-forge for ONNX model ingestion.
Tenstorrent System Management Interface — monitor device telemetry, issue board-level resets, and inspect hardware health. The nvidia-smi equivalent for Tenstorrent hardware.
# Changelog All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## 3.0.26 - 29/07/25 - Added single tray galaxy reset option - Bumped luwen from 0.7.5 -> 0.7.10 - Chip detect now doesn't wait for eth to train for the 6U galaxy's, allowing multi tray resets to happen independently - Updated readme with the new reset option ## 3.0.25 - 29/07/25 - Added packaging ## 3.0.24 - 04/07/25 - Now users have 2 galay reset modes available - glx_reset: resets the galaxy, informs users if there has been an eth failure - glx_reset_auto: resets the galaxy upto 3 times if eth failures are detected ## 3.0.23 - 03/07/25 - Bumped luwen 0.7.3 -> 0.7.5 to fix cargo lock compatibilty issue ## 3.0.22 - 02/07/25 - Bumped tt-tools-common 1.4.16 -> 1.4.17 - Bumped luwen 0.7.2 -> 0.7.3 - Bumped smi 3.0.21 -> 3.0.22 ## 3.0.21 - 26/06/25 - Added option to not re-init chips after reset - Updated galaxy 6u reset option from --ubb_reset to -glx_reset - Removed the a3 arc message before doing a 6u reset, meaning we can reset even when chips are not pcie accessible - Added eth link check and return failure if any of the eth links have a LINK_INACTIVE_FAIL_DUMMY_PACKET failure ## 3.0.20 - 04/06/25 - Chore - bumped tt-tools-common version to fix driver version check for compatability with tt-kmd 2.0.0 ## 3.0.19 - 30/04/25 - Fixed an issue preventing the telemetry thread from being dispatched when the user clicked tab 2 ## 3.0.18 - 22/05/25 - Added BH and WH UBB board type support - Removed the dependency on tt-tools-common for this info ## 3.0.17 - 13/05/25 - Added proper telemetry heartbeat checks for Grayskull ## 3.0.16 - 12/05/25 - Used new ResetTypes from tools-common to simplify reset code - Added a heartbeat spinner to the telemetry pane. We expect this spinner to update about twice per second. If the spinner is not moving, this indicates new telemetry is not being fetched. ## 3.0.15 - 24/04/25 - Patch for the ubb_reset to just discover local only post reset. Looks like eth port status 2 has been re-used to mean connected and pyluwen waits for it to clear, leading to eth timeout. ## 3.0.14 - 21/04/25 - Added wh ubb reset via command line `tt-smi --ubb_reset`. Intention is that this command line option will be removed and integrated into `tt-smi -r` after we update board detection with the correct external naming. - Removed some unused imports and code - no functional changes ## 3.0.13 - 21/03/25 - Removed get\_sw\_versions ## 3.0.12 - 21/03/25 - Chore - bumped luwen version to include eth fw version check fix ## 3.0.11 - 13/03/25 - Chore - bumped luwen version to include enable chips with external connections but no routing ## 3.0.10 - 10/03/25 - Chore - bumped luwen version to include protoc lib detection check ## 3.0.9 - 07/03/25 - Chore - bumped luwen v
Production-ready model serving for Tenstorrent hardware with OpenAI-compatible REST API. Supports continuous batching, multiple models, and all TT hardware configurations.
Comprehensive tool for visualizing and analyzing model execution on Tenstorrent hardware. Interactive graphs, memory plots, tensor details, buffer overviews, operation flow graphs, and multi-instance support.
Tenstorrent Low-Level Kernels: the C++ library that directly programs the RISC-V cores inside each Tensix compute engine. TRISC0 (unpack), TRISC1 (math/FPU/SFPU), and TRISC2 (pack) are all programmed through this layer — it is the interface between TT-Metal kernel code and bare silicon.
Python-based DSL that sits between TT-NN and TT-Metalium — expresses custom fused kernels with progressive disclosure, compiling directly to Tensix. Ships an integrated functional simulator (no hardware needed), line-by-line performance metrics, and AI-agent-friendly tooling. Two packages: tt-lang (compiler + hardware, requires ttnn) and tt-lang-sim (simulator only, works on Linux/macOS without Tenstorrent hardware).
# Changelog All notable changes to TT-Lang will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## Version 1.1.1 ### Compiler - Fix for live-interval boundary computation (issue [#536](../../issues/536)) - Fix for all-zero results in FP32 reductions (issue # [#533](../../issues/533)) - Fix for inferred `pop` and `push` (issues [#536](../../issues/536), [#554](../../issues/554)) - Fix for write pointer tracking on pipe sender accross iterations (issue [#578](../../issues/578)) - Fix to report data type mismatch error - Fix to report DFB over allocation error (issue [#511](../../issues/511)) - Support for pipenet predicates `is_src`, `is_dst` and `is_active` (issue [#541](../../issues/541)) - Support for `ttl.math.typecast` ### Simulator - Support for inferred `pop`, `push` and `copy`'s transfer handle `wait` - Support for pipenet predicates `is_src`, `is_dst` and `is_active` - Support `all_gather` - Support `bfloat8_b` - Improved/actionable error messages - Improved performance by simulating math in FP32 ### Infrastructure - TT-Lang installable with `pip install tt-lang` for full installation and `pip install tt-lang-sim` for simulator only - [Matmul benchmarks](benchmarks/matmul/README.md) ## Version 1.0.0 ### Compiler - Support `+=` syntax in conjunction with dot product (`@`) lowered to packer L1 accumulation - Support implicit temporary compute-kernel-local DFBs - Support `ttl.Pipenet` - Support implicit `ttl.Block.push` and `ttl.Block.pop` - Support implicit `ttl.Transfer.wait` - Support for `expm1`, `exp2`, `ceil`, `sign`, `gelu`, `silu`, `hardsigmoid`, `square`, `softsign`, `signbit`, `frac`, `trunc` in `ttl.math` ### Simulator - Support for `ttl.GroupTransfer` - SPMD and mesh device simulation support - Support for `ttnn.all_reduce` CCLs - Use tracing to report statistics with `tt-lang-sim-stats` - Remote L1 reads/writes statistics ### Examples and documentation - Matmul tutorial ## Version 0.1.8 ### Compiler - Support for dot product operator (`@`) with lowering to [`ckernel::matmul_block`](https://docs.tenstorrent.com/tt-metal/v0.55.0/tt-metalium/tt_metal/apis/kernel_apis/compute/matmul_block.html) - Support for fusing matmul and certain elementwise operations - Support lowering to `pack_tile_block` - Support for `ttl.math.fill`, `ttl.math.reduce_sum`, `ttl.math.reduce_max`, and `ttl.math.transpose` - Support for arbitrary sub-blocking including dot product K-dimension to allow maximizing L1 usage and reuse - Support for `sin`, `cos`, `tan`, `asin`, `acos`, `atan` in `ttl.math` - Support for L1 sharded tensors - Support for tensors with BF8 data type - SPMD support (`ttnn.open_mesh_device`) ### Simulator - Track L1 space and number of DFBs usage and warn when exceeded - Support for tensors with row-major layout - Support for L1 sharded tensors ### Examples and documentat
Web-based GUI for deploying and chatting with AI models on Tenstorrent hardware. Handles all technical setup automatically — deploy models, run inference, and explore capabilities through a simple browser interface.
Lightweight BMC (Baseboard Management Controller) for STM32 and similar MCUs, with Web UI, Redfish API, and HTTPS support. Built on Zephyr RTOS. Used in Tenstorrent systems.
User-mode driver for Tenstorrent hardware. The userspace layer that sits between the kernel module and higher-level SDKs.
# Changelog ## [0.9.5] - 2026-05-12 ### Changed Hardware hang detection for NOC and PCIe. Tracy profiler integration with instrumentation across TLB, PCIe and sysmem paths. DeviceProtocol ported to TTDevice, including DMA migration. SocDescriptor split into static (SocArchDescriptor) and runtime parts. LITERAL coordinate system in CoreCoord. Multicast to all TENSIX cores. SMN support. SWEmuleChip software emulation chip and Quasar simulation support (incl. 4GB TLB). Unified UmdException/UMD_ASSERT/UMD_THROW error handling across the codebase. ## [0.9.4] - 2026-03-18 ### Changed TopologyDiscoveryOptions refactoring. TopologyDiscoveryOption to retrain ETH links on 6u. TLBs for TTsim. DRAM retrain support. DeviceProtocol changes. Simulator in TTDevice changes. ETH heartbeat check. ## [0.9.3] - 2026-02-24 ### Changed Sigbus safe read write API. Remove 4U related code. Implement BH SPI as well, so full SPI support. P150 expects harvested cores. TT_VISIBLE_DEVICES uses logical IDs. ## [0.9.2] - 2026-02-09 ### Changed SPI interface for Wormhole. PCI BDF based sorting and filtering. Multicast PCI DMA. Support Blackhole loudbox. Many code fixes and test enhancements. ## [0.9.1] - 2026-01-23 ### Changed Started publishing to pypi. ## [0.9.0] - 2026-01-23 ### Changed Warm reset notification and callback implementation. ## [0.8.6] - 2026-01-20 ### Changed Make predicting ETH FW from CMFW optional in TopologyDiscovery. ## [0.8.4] - 2026-01-16 ### Changed Use older manylinux image ## [0.8.3] - 2026-01-15 ### Changed Reverted remote discovery issue ## [0.8.2] - 2026-01-15 ### Changed Support warm reset without secondary bus reset. Expose subsystem vendor id. ## [0.8.1] - 2026-01-15 ### Changed Support dma functions on TTDevice layer ## [0.8.0] - 2026-01-14 ### Changed Many functional fixes and minor changes. Final fixes needed for integration into tt-smi. Also contains adjustments needed for integration into exalens. ## [0.7.0] - 2025-11-29 ### Changed Changed to a more generic arc_msg API. ## [0.6.0] - 2025-11-24 ### Changed Change the usage of TLBs such that KMD is in control of TLB allocation instead of UMD. TLBs are now allocated using KMD's dedicated API. ## [0.5.3] - 2025-11-14 ### Changed Added generation of .deb and .rpm packages. Added three separate packages (runtime, development and python). ## [0.5.1] - 2025-11-12 ### Changed Manylinux builds and Pypi test publishing. Many smaller fixes and improvements. ## [0.4.0] - 2025-10-18 ### Changed Removed old type names. ## [0.3.0] - 2025-10-17 ### Changed Many smaller fixes and improvements. TTsim support improvements. JTAG support improvement. Fixing CMake install path. Further work on integrating new KMD TLBs. ## [0.2.0] - 2025-09-15 ### Changed A couple of smaller fixes and improvements, including L2CPU harvesting, fixes for new FW. Better TTSim support. Further JTAG support. Introduced new soft reset API. Introduced lite fabric initial version.
System firmware for Tenstorrent hardware. Low-level system initialization and control firmware that runs on-device.
Tenstorrent system interface library written in Rust. Low-level Rust bindings for communicating with and managing TT hardware.
TVM for Tenstorrent ASICs. Brings the Apache TVM compiler stack to Tenstorrent hardware, enabling model compilation from TensorFlow, PyTorch, ONNX, and more.
ISA-level simulator for the Tensix compute engine. Simulates the matrix, vector, and scalar units inside each Tensix core.
Frontend integration for PyTorch with tt-mlir. Compile PyTorch models directly to Tenstorrent hardware via torch.compile integration.
Tenstorrent firmware repository. Board management and control firmware for Tenstorrent accelerator cards.
Install the complete Tenstorrent software stack with one command. Handles drivers, firmware, Python environment, and SDK setup automatically.
Low-level hardware debugger for Tenstorrent devices. Inspect register state, memory contents, and kernel execution at the hardware level.
Configure Ethernet routing on multi-card Tenstorrent systems. Flash NB cards to use specific ETH routing configurations for scale-out deployments.
# Changelog All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## 1.2.11 - 17/06/2025 ### Updated - Updated mesh coord generation to be connection type agnostic - Added failure and exit if mesh type detected, but not enough connections - Added warning in README about lack of supoort for BH and 6U boards ## 1.2.10 - 05/06/2025 ### Updated - Bumped tt-tools-common version to fix driver version check for compatability with tt-kmd 2.0.0 ## 1.2.9 - 30/05/2025 ### Updated - Bug fix for https://github.com/tenstorrent/tt-topology/issues/39. Now the tool will use a DFS longest path to determine a linear layout if its not a fully connected graph. - Updated initial device detection - now it needs full noc access for octopus and list options ## 1.2.8 - 08/05/2025 ### Updated - Fixed issue where tool would fail when PCI interfaces don't start from ID 0 - Now using actual PCI interface IDs from devices instead of assuming sequential numbering ## 1.2.7 - 07/05/2025 ### Updated - Use tools-common 1.4.15 - Use type checking in octopus reset ## 1.2.6 - 05/05/2025 ### Updated - Bug fix: added "ignore-eth" flag to first chip detect to avoid eth training loops forever and truly detect pcie only chips - Chore: bumped luwen ## 1.2.5 - 15/04/2025 ### Updated - When flashing to isolated mode, we now flash the WH ethernet ports to a disabled state, in order to prevent their use. ## 1.2.4 - 02/04/2025 ### Updated - You can now run `tt-topology -l isolated` to flash cards to the default (non-connected) state - Users are now warned about missing or loose cables ## 1.2.3 - 21/03/2025 ### Fixed - Bumped luwen (0.6.2 -> 0.6.3) to include eth version check bug for TG setup ## 1.2.2 - 13/03/2025 ### Fixed - Bumped luwen version to make it more robust against eth fw updates ## 1.2.1 - 13/03/2025 ### Fixed - Moved the spi reads after the reset to increase stability during M3 L2R copy - Bumped luwen version ## 1.2.0 - 06/03/2025 ### Fixed - Updated how local eth board info is calculated to make it agnostic to eth fw version - bumped tt-tools-common version - Added traceback printing when catching exceptions in main. ## 1.1.5 - 14/05/2024 ### Updated - Bumped luwen (0.3.8) and tt_tools_common (1.4.3) lib versions - Removed unused python libraries ## 1.1.4 - 25/03/2024 ### Fixed - Changed detect_chips with detect_chips_with_callback to enable detailed debug info. ## 1.1.3 - 22/03/2024 ### Fixed - Bumped tt-tools-common version to avoid pip discrepancy. ## 1.1.2 - 22/03/2024 ### Fixed - Fixed command line bug when no args are provided. ## 1.1.1 - 21/03/2024 ### Fixed - Fixed reference to pyluwen lib ## 1.1.0 - 12/03/2024 ### Added - Octopus Configuration (4 n150s connected to 1 galaxy) ## 1.0.2 - 12/03/2024 ### Fixed - Dependency bug with tt_tools
Network-on-chip Performance Estimator for Tenstorrent Tensix-based devices. Model and estimate NoC utilization before running kernels on hardware.
Optimized training recipes for a variety of ML models on Tenstorrent hardware, powered by the TT-Forge compiler stack. Reference implementations for fine-tuning and training from scratch.
End-to-end AI applications running on Tenstorrent AI accelerators. Complete application examples from retrieval-augmented generation to image generation pipelines.
Tenstorrent firmware update utility. Flash new firmware onto Tenstorrent accelerator cards from the command line.
# Changelog All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). ## 3.4.0 - 30/07/25 - Bump pyyaml 6.0.1 -> 6.0.2 - Improve error message formatting - No longer have to use --force for flashing BH cards ## 3.3.5 - 03/07/25 - Bump luwen 0.7.3 -> 0.7.5 ## 3.3.4 - 02/07/25 - Bump tt-tools-common 1.4.16 -> 1.4.17 - Bump luwen 0.6.4 -> 0.7.3 ## 3.3.3 - 05/06/2025 - Bumped tt-tools-common version to fix driver version check for compatability with tt-kmd 2.0.0 ## 3.3.2 - 14/05/2025 - Bump tt-tools-common version to latest ## 3.2.0 - 12/03/2025 ### Updated - luwen version bump to bring inline with tt-smi; provides stability fixes ## 3.1.3 - 06/03/2025 ### Added - luwen version bump to include bh arc init checks ## 3.1.2 - 28/02/2025 ### Added - Support for more BH cards: p100a, p150, and p150c ## 3.1.1 - 06/01/2025 ### Updated - Bumped luwen version to accomodate Maturin updates ## 3.1.0 - 29/10/2024 ### Added - Support for flashing the BH tt-boot-fs file format - Bumped luwen version to 0.4.6 to allow resets when chip is inaccessible ## 3.0.2 - 17/10/2024 ### Fixed - Unbound variable when exception is thrown when getting current fw-version ## 3.0.1 - 16/10/2024 ### Changed - Bumped luwen version to 0.4.5 to resolve false positives on bad chip detection ## 3.0.0 - 23/08/2024 - NO BREAKING CHANGES! Major version bump to signify new generation of product. - Added support for p100 ## 2.2.0 - 19/07/2024 ### Updated - Added support for an alternative spi flash configuration via a new version of luwen ## 2.0.8 - 14/05/2024 ### Updated - Bumped luwen (0.3.8) and tt_tools_common (1.4.3) lib versions ## 2.0.1 - 2.0.7 - Dependency updates ## 2.0.0 - WH flash release ## 1.0.0 - GS flash release
48 interactive lessons covering the full Tenstorrent developer path — from hardware detection to custom training — with click-to-run commands and hardware auto-detection. Available in VSCode and code-server.
# Changelog All notable changes to the TT-VSCode-Toolkit will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). --- ## [0.0.503] - 2026-06-12 ### Fixed - **Self-review fixes** — 29 issues confirmed by automated adversarial review: corrected MESH_DEVICE enum values in code-fence comments (T3000→T3K, n150→N150, Galaxy→GALAXY throughout vllm-production, image-generation, bounty-program, version-compatibility, step-zero, README); removed `<sup>™/</sup>` HTML injected into yaml/bash code fences (ct3-configuration-patterns, step-zero); fixed api-server print string back to `tt-metal ready`; corrected `p100`→`P100` in hardware-detection and QB_follows prose; completed T3K→T3000 normalization in ttsim/cookbook-particle-life prose callouts; fixed FAQ duplicate stale QB2 paragraph and TTNN/tt-metal prose table cells; fixed bare `tt-metal` prose in image-generation (→ TT-Metalium); updated link display text in cookbook-overview and tt-inference-server (github URL→ product name); clarified Vale config comments (ProductNames.yml T3000 exception, Terminology.yml link-text caveat). ## [0.0.502] - 2026-06-11 ### Fixed - **QB2 → TT-QuietBox 2 in llms.txt** — the LLM context file (consumed by the content website) had 11 prose `QB2` references; all replaced with `TT-QuietBox 2`; URL slugs (`qb2-*`) left untouched. ## [0.0.501] - 2026-06-11 ### Fixed - **QB2 → TT-QuietBox 2 prose normalization** — replaced all `QB2` shorthand in prose with the full `TT-QuietBox 2` product name across `ttsim-twenty-and-ten.md`, `cookbook-particle-life.md`, and `FAQ.md`; lesson title slugs (`qb2-*`) and command IDs left untouched. ## [0.0.500] - 2026-06-11 ### Changed - **Version bump** — increment to 0.0.500 after merging copyedit branch with origin/main; consolidates copyedit normalization (hardware IDs, TT-Metalium™/TT-NN™ trademarks, TT-QuietBox naming) with main's ttsim, AnimateDiff Phase 2.5, and mobile improvements. --- ## [0.0.477] - 2026-05-27 ### Changed - **Prose copyedit pass** — fixed TT-Forge<sup>™</sup> trademark placement in `tt-xla-jax.md`; updated `STYLE_GUIDE.md` hardware casing rules (`n150`/`n300`/`T3000`/`p300c`, capitalized `Galaxy`); normalized hardware IDs and `TTNN`→`TT-NN` in prose and sample output; renamed `TT Metal`→`TT-Metalium` in `tt-inference-server.md`. Extended `normalize-hardware-copy.js` and `normalize-ttnn-copy.js`; added `normalize-tt-metal-copy.js`. Polished `STYLE_GUIDE.md` trademark examples; fixed `normalize-open-source-copy.js` to skip inline code; added `plans/vscode-toolkit-copyedit-pr.md` PR summary. --- ## [0.0.476] - 2026-05-27 ### Changed - **TT-Metalium<sup>™</sup> and TT-NN<sup>™</sup> trademarks** — first prose mention per page now uses `TT-Metalium` and `TT-NN` (trademark, not registered). Updated `scripts/add-tt-product-trademarks.js` and `STYLE_GUIDE.md`; migrated pri
A vibrant htop-style visualizer for Tenstorrent hardware written in Rust. Real-time process and utilization view for TT accelerators.
Generate infinite videos and images (and imaginative prompts to inspire them) on Tenstorrent's Quietbox 2. Fully local generative media pipeline.
Generates short, temporally coherent animated GIFs using the AnimateDiff model on Tenstorrent hardware. Phase 1 runs the correct SD 1.4 + MotionAdapter architecture on CPU; Phase 2 accelerates spatial denoising on Blackhole using the TTNN UNet. Produces vibrant 8-frame animations in ~15 s/frame on a P300C.