New release: ttsim v1.7.3
New release: tt-vscode-toolkit v0.0.454
New release: tt-animatediff v0.1.0
New release: dstack 0.20.23
New release: ttnn-visualizer v0.88.0
New release: tt-toplike v0.6.1
New release: tt-flash v3.8.0
New release: tt-system-firmware v19.10.0
New release: tt-inference-server v0.15.0
New release: BarraCUDA v0.5.0
New release: ttas v0.1.0
New release: tt-local-generator v0.3.4
New release: TT-Studio v2.6.0
New release: tt-smi v5.2.0
Kernels written by and for tt-lang are loaded to Blackhole chips on the Tenstorrent Quietbox 2, powering all forms of "AI" in the open source Civilization clone, FreeCiv. Look at all those fish!
A 6,500-word community deep dive into the Blackhole p100a architecture: the tile model (Tensix, DRAM, SiFive x280 L2CPU, Ethernet, PCIe, NoC arc), firmware startup sequence, MOP micro-op processor, replay buffer, FPU/SFPU sync, and the anatomy of a kernel. From the author of blackhole-py.
Lecture 20 from William & Mary's graduate Computer Architecture course. Frames Tenstorrent in the landscape between GPUs and TPUs, draws comparisons to Cerebras and SambaNova, then dives deep into the Wormhole chip and Tensix core: the 5 RISC-V core design, SFPU, NoC, and dataflow execution model.
A fused kernel for the Grayskull architecture implementing Transformer self-attention entirely within SRAM. Combines matrix multiply, attention score scaling, and Softmax without DRAM accesses, achieving significant speedups over non-fused implementations.
Evaluates the Tenstorrent Grayskull e75 RISC-V accelerator for matrix multiplication at reduced numerical precision (BFP8 and LoFi), a fundamental kernel in LLM inference computation.
Evaluates three strategies for scaling an N-body code across multiple Tenstorrent Wormhole accelerators. Builds on the established performance of single-card N-body work to explore parallelism via the on-chip NoC and multi-accelerator configurations.
Compiler system that automatically generates efficient dataflow plans for tile-based languages on spatial accelerators including Tenstorrent Wormhole. Exploits on-chip network forwarding between processing elements to reduce DRAM pressure.
Shows that Text-to-Speech inference on Tenstorrent Lightning V2 achieves 4× lower cost than NVIDIA L40S. Applies BlockFloat8 (BFP8) and low-fidelity (LoFi) precision strategies to TTS despite their greater numerical fragility compared to LLMs.
Tenstorrent Low-Level Kernels: the C++ library that directly programs the RISC-V cores inside each Tensix compute engine. TRISC0 (unpack), TRISC1 (math/FPU/SFPU), and TRISC2 (pack) are all programmed through this layer — it is the interface between TT-Metal kernel code and bare silicon.
Maps 2D 5-point stencil computations onto the Tenstorrent Wormhole RISC-V AI dataflow accelerator via two implementations: element-wise decomposition (Axpy) and matrix-multiplication reformulation (MatMul). Profiling shows the isolated Wormhole kernel is competitive with CPU execution, with PCIe transfers and initialization driving end-to-end overhead; Axpy achieves lower energy than the CPU baseline at large scales. Identifies architectural and software directions for making AI accelerators viable for HPC stencil workloads. 2025.
New release: whisper 1.861
New release: tt-sim v1.0
Boltz-2 biomolecular model for drug discovery on Tenstorrent Blackhole. Supports single-card and multi-card configurations — QuietBox (4×) and Galaxy (32×). Approaches physics-based FEP accuracy at 1000× the speed.
Deep-dive into the Tenstorrent architecture and Metalium programming model — circular buffers, kernel synchronization, NoC routing, and where the footguns are. The honest guide to thinking in Tensix.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Step-by-step guide to getting a Tenstorrent card running on Arch Linux with the full Metalium stack. Practical troubleshooting from someone who did it the hard way first.
Honest field notes from getting a Grayskull card running and writing first Metalium kernels. Covers setup pitfalls, processor hangs, memory protection quirks, and what makes Metalium compelling despite early rough edges.
Ports the Cooley-Tukey FFT algorithm to the Wormhole n300 RISC-V accelerator. The Wormhole draws 8× less power and consumes 2.8× less energy than a 24-core Xeon Platinum for a 2D FFT. ISC 2025.
Ports the Cooley-Tukey FFT algorithm to the Wormhole n300 RISC-V accelerator. The Wormhole draws 8× less power and consumes 2.8× less energy than a 24-core Xeon Platinum for a 2D FFT. ISC 2025.
Accelerates an astrophysical N-body simulation on the Wormhole n300. Achieves 2× speedup and 2× energy savings over a highly optimized CPU implementation. SC '25 Workshop.
Accelerates an astrophysical N-body simulation on the Wormhole n300. Achieves 2× speedup and 2× energy savings over a highly optimized CPU implementation. SC '25 Workshop.
Implements three numerical kernels and composes them into a conjugate gradient solver on Wormhole. Demonstrates AI accelerators merit consideration for HPC workloads traditionally dominated by CPUs and GPUs. 2026.
Explores stencil computation on the Grayskull PCIe RISC-V accelerator. Early academic work examining TT hardware for HPC stencil workloads. 2024.
Makes multi-tenant NPU sharing practical for Blackhole-class hardware using polynomial-time allocation algorithms. Delivers up to 1.37× higher utilization and 1.14× faster workload completion. Up to 890,000× faster than NP-hard baselines.
Three agentic projects running fully on-device: local AI agents on QuietBox 2, a coding assistant powered by Aider against a local inference server, and the OpenClaw AI assistant on QuietBox 2. No cloud APIs — all inference runs on TT hardware.
Three agentic projects running fully on-device: local AI agents on QuietBox 2, a coding assistant powered by Aider against a local inference server, and the OpenClaw AI assistant on QuietBox 2. No cloud APIs — all inference runs on TT hardware.
Three lesson-projects covering on-device video synthesis: frame-by-frame diffusion with tt-local-generator, native AnimateDiff video animation, and video generation on QuietBox 2. All run entirely on TT hardware with no cloud dependency.
Three lesson-projects covering on-device video synthesis: frame-by-frame diffusion with tt-local-generator, native AnimateDiff video animation, and video generation on QuietBox 2. All run entirely on TT hardware with no cloud dependency.
Three lesson-projects covering on-device video synthesis: frame-by-frame diffusion with tt-local-generator, native AnimateDiff video animation, and video generation on QuietBox 2. All run entirely on TT hardware with no cloud dependency.
Particle Life simulation on Tenstorrent hardware — an emergent-behavior N-body system where simple attraction/repulsion rules between species produce complex lifelike patterns. Cookbook recipe demonstrating parallel N-body compute on Tensix.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
A Tenstorrent-powered claw machine that rewards players with real prizes. The QuietBox 2 runs local AI inference to act as an agent controlling the claw hardware — the OpenClaw AI assistant lesson builds directly on this project.
On-device image generation with Stable Diffusion XL running entirely on Tenstorrent hardware. Full inference pipeline with no cloud dependency.
End-to-end image classification project using TT-Forge — compile and run a PyTorch classification model on Tenstorrent hardware with no kernel authoring required.
Interactive browser-based visualizer of the Tenstorrent Tensix grid architecture. Explore the NoC, core layout, and dataflow patterns without hardware — a great companion for learning kernel programming.
TT-Metalium implementation of Conway's Game of Life as a cookbook recipe. Each generation is a full parallel kernel dispatch over the grid — a clean introduction to stateful compute on Tensix cores.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Three hands-on TT-Metalium kernel recipes: a Mandelbrot fractal explorer, real-time audio signal processing pipeline, and custom image filter stack. Each recipe is a complete kernel project with full source in the lesson.
Three hands-on TT-Metalium kernel recipes: a Mandelbrot fractal explorer, real-time audio signal processing pipeline, and custom image filter stack. Each recipe is a complete kernel project with full source in the lesson.
Three hands-on TT-Metalium kernel recipes: a Mandelbrot fractal explorer, real-time audio signal processing pipeline, and custom image filter stack. Each recipe is a complete kernel project with full source in the lesson.
Three hands-on TT-Metalium kernel recipes: a Mandelbrot fractal explorer, real-time audio signal processing pipeline, and custom image filter stack. Each recipe is a complete kernel project with full source in the lesson.
PJRT device plugin for Tenstorrent hardware. Enables JAX, PyTorch/XLA, and other XLA-based frameworks to target TT accelerators.
Production-ready model serving for Tenstorrent hardware with OpenAI-compatible REST API. Supports continuous batching, multiple models, and all TT hardware configurations.
Python-based DSL that sits between TT-NN and TT-Metalium — expresses custom fused kernels with progressive disclosure, compiling directly to Tensix. Ships an integrated functional simulator (no hardware needed), line-by-line performance metrics, and AI-agent-friendly tooling. Two packages: tt-lang (compiler + hardware, requires ttnn) and tt-lang-sim (simulator only, works on Linux/macOS without Tenstorrent hardware).
Install the complete Tenstorrent software stack with one command. Handles drivers, firmware, Python environment, and SDK setup automatically.
48 interactive lessons covering the full Tenstorrent developer path — from hardware detection to custom training — with click-to-run commands and hardware auto-detection. Available in VSCode and code-server.
48 interactive lessons covering the full Tenstorrent developer path — from hardware detection to custom training — with click-to-run commands and hardware auto-detection. Available in VSCode and code-server.
TT-Deploy, May 1st, 2026 – Customers running Tenstorrent in production — David Bennett (AI&), Alex Nataros (Cirrascale), Sanchayan Sinha (Turiyam), and Mike Gorbinski (Virtu Financial) — share why they chose our hardware and what they're building on it.
TT-Deploy, May 1st, 2026 – Jim Keller reads our love letter to DeepSeek v4, reviews how Tenstorrent achieves unlimited scale, and shows off our video gen benchmark on Artificial Analysis. He closes with Tenstorrent is where AI runs.
TT-Deploy, May 1st, 2026 – Stan Sokorac, Sr. Fellow, Software, reveals the results of an agentic pipeline that continuously tests random Hugging Face models on Tenstorrent hardware: a 90% pass rate, projecting to roughly 2.5 million models. Plus a deep dive on TT-Lang, TT-Forge, and the only 100% open-source high-performance AI software stack.
TT-Deploy, May 1st, 2026 – Tenstorrent's Amr Elashmawi sits down with Justen Aguillon (Equinix), Sheng Yeo (OrionVM), and Abhishek Bhargava (BetterBrain) to walk through the Equinix Distributed AI Hub — a full-stack sovereign agentic AI platform now live in Ashburn.
TT-Deploy, May 1st, 2026 – Jasmina Vasilović, Senior Fellow of ML Frameworks & Programming Models, walks through the Tenstorrent software stack and how Blackhole's switch-free architecture extends the low-cost serving curve where GPU economics collapse — plus a partner spotlight on Prodia's faster-than-real-time WAN 2.2 video generation.
TT-Deploy, May 1st, 2026 – Jim Keller walks through the fundamentals – scale, general-purpose, and lower cost – that enable Tenstorrent to be built for the constant changing landscape of AI.
Join as we unveil Tenstorrent’s AI solutions deployed at scale. See the full breadth of what we've built — validated by real architecture, benchmarks, and customer deployments.
All these videos were generated at home on Tenstorrent TT-QuietBox 2 using WAN 2.2 and SkyReels.
Using Qwen3-32B is used in three inter-connected agent demos. Learn how on your own: https://docs.tenstorrent.com/tt-vscode-toolkit/lessons/qb2-local-agents/ Visualizer is tt-toplike: https://docs.tenstorrent.com/tt-toplike Time is condensed in multiple segments.
Each chip is compiling independently from the others. Some time manipulation to fit within a minute, this covers about 14 minutes of compilation. Find the tt-forge-compiletron here: https://github.com/tsingletaryTT/tt-forge-compiletron
Using Qwen3-32B is used in three inter-connected agent demos. Learn how on your own: https://docs.tenstorrent.com/tt-vscode-toolkit/lessons/qb2-local-agents/ Visualizer is tt-toplike: https://docs.tenstorrent.com/tt-toplike
New release: tt-bh-linux v0.11
New release: tt-kmd ttkmd-2.8.0
New release: luwen v0.8.5
New release: tt-installer v2.2.1
New release: tt-topology v1.2.19
New release: tt-firmware v19.6.0
New release: nvtop 3.3.2
New release: RiESCUE v1.7.0
Thaddeus Fortenberry, VP of Robotics + Automotive at Tenstorrent, discusses how we're paving the open road to chiplet interoperability in robotics. Open Chiplet Atlas | https://www.openchipletatlas.org Tenstorrent Robotics IP | https://tenstorrent.com/ip X https://twitter.com/tenstorrent Discord https://discord.com/invite/tenstorrent
Watch Jim Keller and Miles Dooley kick off Tenstorrent's IP launch event in San Francisco, TT-Blueprint. Tenstorrent IP | https://tenstorrent.com/ip RISC-V CPU | https://tenstorrent.com/ip/risc-v-cpu Open Chiplet Atlas | https://www.openchipletatlas.org/ X https://twitter.com/tenstorrent Discord https://discord.com/invite/tenstorrent
Wei-Han Lien, Chief Architect at Tenstorrent, discusses how Tenstorrent is pushing the chiplet ecosystem forward with Open Chiplet Atlas. Open Chiplet Atlas | https://www.openchipletatlas.org/ Tenstorrent IP | https://tenstorrent.com/ip RISC-V CPU | https://tenstorrent.com/ip/risc-v-cpu X https://twitter.com/tenstorrent Discord https://discord.com/invite/tenstorrent
Aniket Saha, VP of Product Strategy at Tenstorrent, discusses Tenstorrent's new, open business model for high-performance IP and why the old, closed-ecosystem approach isn't working. Tenstorrent IP | https://tenstorrent.com/ip X https://twitter.com/tenstorrent Discord https://discord.com/invite/tenstorrent
This session, led by Tapasvi Patel, Sr. Engineer at Tenstorrent, covers how Tenstorrent's compiler (TT-Forge) and runtime (TT-NN) manage device meshes, tensor sharding, and collective communications, with a focus on using JAX. Also, see how Shardy helps with automatic parallelization. 0:00 Introduction 0:25 Modeling multi-device in Tenstorrent's compiler & runtime 5:33 Parallelism strategies and multi-chip hardware 8:38 Core TT-NN runtime features for multi-device (meshes, tensors, mappers)…
Heterogeneous GPU infrastructures present a binary compatibility challenge: code compiled for one vendor's GPU will not run on another due to divergent instruction sets, execution models, and driver stacks . We propose hetGPU, a new system comprising a compiler, runtime, and abstraction layer that together enable a single GPU binary to execute on NVIDIA, AMD, Intel, and Tenstorrent hardware. The hetGPU compiler emits an architecture-agnostic GPU intermediate representation (IR) and inserts…
With the rapid development of artificial intelligence (AI) applications, an emerging class of AI accelerators, termed Inter-core Connected Neural Processing Units (NPU), has been adopted in both cloud and edge computing environments, like Graphcore IPU, Tenstorrent, etc. Despite their innovative design, these NPUs often demand substantial hardware resources, leading to suboptimal resource utilization due to the imbalance of hardware requirements across various tasks. To address this issue,…
TT-Forge is Tenstorrent’s MLIR-based compiler. This Q&A covers its architecture, front-end support (Torch, JAX), tools like TT-Explorer, and key discussions on runtime customization, debugging, production timelines, quantization, model training, the PiKernel DSL, and how to contribute. 0:00 What is TT-Forge? 1:09 Accessing Tenstorrent hardware on Koyeb 4:39 Compiler production timeline and manual model implementation 7:24 TT-Explorer for quantization schemes 8:26 Model training framework with…
Tenstorrent Developer Day Livestream April 3rd at 10 am PT
TT-Forge is Tenstorrent’s MLIR-based compiler. Learn how TT-Forge integrates with our AI software stack, why we’re building on MLIR, and the features that make TT-Forge flexible and adaptable. 0:00 Introduction 1:36 Integration with key ML frameworks 2:05 Why MLIR? 3:10 Driving principles of TT-Forge 7:48 Recap tt-forge: https://github.com/tenstorrent/tt-forge tt-mlir: https://github.com/tenstorrent/tt-mlir Follow Tenstorrent on X at https://x.com/tenstorrent Join our Discord at…
New release: tt-buda v0.19.3