The installer now defaults to Docker as its container runtime instead of the previous option, streamlining setup for most users while maintaining flexibility for those with different requirements. This change simplifies the initial configuration experience and aligns with Docker's widespread adoption in development environments.
This patch fixes an idempotency issue in the autoInit method for single-chip setups, ensuring that repeated initialization calls don't cause unintended side effects. If you're using TensixViz for automated visualization workflows on single-chip systems, this prevents state inconsistencies when initialization gets triggered multiple times—a common scenario in development and testing environments.
This release brings steady progress on model coverage and infrastructure improvements, with updated dependencies across tt-xla, tt-mlir, and tt-metal. The team has added several new model demos—including Llama, Tiny Llama, and Qwen3—alongside expanded ecosystem skills for AI workflows. Behind the scenes, there's been a focus on testing and CI/CD refinement: installation validation now uses AI-driven testing, the benchmark infrastructure has been removed to streamline the codebase, and AI bringup scripts have migrated to a dedicated repository to keep concerns separated. The performance tables show broad support for LLM inference across architectures like Falcon, Llama, Qwen, and Mistral at various model sizes, plus solid coverage for vision and multi-modal tasks on both n150 and p150 hardware.
This release brings substantial improvements to the vLLM plugin and infrastructure, alongside broader model coverage and performance tuning across the ecosystem. Key enhancements include KV cache dtype defaults (now bfp_bf8), decode graph optimization (reduced from 5 to 2), and skip mechanisms for wasteful profile runs that improve both compile time and throughput. Multihost support landed with deferred tensor transfers, vLLM warmup phase optimization, and better diagnostics through logging. The team also added support for new models like Mixtral and Pixtral in the vLLM plugin, expanded multimodal capabilities (Rotary Embedding with multimodal sections), and brought component tests online for major diffusion models including Mochi-1, HiDream, Playground v2.5, and various video generation pipelines—all while maintaining an aggressive push toward stability through better error handling, RMSNorm fusion patterns, and refined test categorization across n150, n300, and p150 hardware.
This release brings synchronized updates across tt-mlir and tt_forge_models dependencies, along with test infrastructure improvements and a fix for ResNet50 benchmark regressions that addresses device teardown segfaults and layout configuration issues. The team also refactored ONNX test suites and corrected a verify config bug that was writing to the wrong global variable. You can grab the update via PyPI or Docker, and the changes span from late May through late June.
See how Blackhole's switch-free architecture extends the low-cost serving curve by unifying compute, memory, and networking into a single scalable system. Presented by Jasmina Vasilović, Senior Fellow, ML, Compilers and Models at Tenstorrent. Learn more about Tenstorrent Galaxy™ Blackhole: https://tenstorrent.com/hardware/galaxy
The simulator now covers more hardware configurations and instruction variants across Tenstorrent architectures, including pack format modes, ReLU intermediate formats, and expanded ADC/RWC support on Wormhole and Blackhole devices. Recent additions handle debug bus access to multiple subsystems, Tensix replay buffer wrapping, and 64-bit atomics in QSR DRAM cores—changes that help developers more accurately validate their workloads against actual silicon behavior. The release also patches around a known Wormhole/Blackhole erratum and improves error reporting for edge cases, making the simulator a more reliable reference for hardware-software codesign.
The inference server now runs a broader set of models across Blackhole QuietBox 2 hardware, with support added or uplifted for everything from the large language models (gpt-oss-120b, Llama-3.1-8B-Instruct, Qwen3-32B) to multimodal and generative workloads (Z-Image-Turbo, FLUX.1-dev, Wan2.2-T2V-A14B). Wormhole Galaxy systems also gain uplifted implementations of whisper speech recognition models. The update brings the stack to a consistent TT-Metal commit and includes revised software version recommendations for Galaxy deployments, ensuring developers have a stable baseline for production inference across the hardware lineup.
The installer now bundles the setup script directly into the main installation script at build time, eliminating an external dependency and simplifying the installation flow. This consolidation makes the installer more self-contained and reduces moving parts during the setup process.
The installer now includes a formalized schema (v1) that standardizes how installation configurations are defined and validated, along with comprehensive tests to ensure reliability. This structured approach should make it easier to maintain consistent installations across different environments and lay groundwork for future tooling that can work with installation metadata programmatically.
The player schema handling has been fixed to ensure that autoInit operations are truly idempotent, preventing unexpected state issues when initialization runs multiple times. This is particularly useful if you're working with visualization tooling that might trigger initialization in unpredictable patterns, as you can now rely on consistent behavior regardless of how many times the setup runs.
tt-umd v0.9.7 adds several new error types for firmware diagnostics (heartbeat, routing config, and mismatch detection) and exposes DeviceTimeoutError to Python so applications can catch and handle per-operation MMIO timeouts gracefully. The release refactors device construction and protocol layers to push more logic into TTDevice for consistency across backends, fixes coordinate translation for harvested Ethernet cores on Blackhole, and introduces SimulationSocket for per-chip simulation sockets and OpTimeoutGuard as a reusable timing primitive. System packages are available for Ubuntu 22.04/24.04 and Fedora 39, alongside Python wheels for 3.9–3.13 across x86_64 and aarch64 architectures.
This release focuses on infrastructure improvements, architectural support expansion, and numerous bug fixes across the tt-metal stack. Key highlights include bringing up Quasar architecture support with its SFPU compute APIs, migrating dataflow kernels to the Device 2.0 NoC API, and stabilizing DeepSeek prefill operations with optimizations for multi-chip deployments. The team also made significant progress on descriptor framework migrations for better program caching and resolved critical issues in sampling, MoE operations, and layer normalization that were causing device hangs and accuracy regressions. This is primarily a foundation-focused release with substantial internal refactoring to enable future feature work.
dstack now reuses SSH connections to instances through a server-side pool, cutting the per-operation overhead that previously slowed down runs, dev environments, and services — the improvement is especially noticeable on servers managing many instances. The release also speeds up the runs listing endpoint, making dstack ps and the UI snappier for projects with large run histories, and fixes AWS Capacity Reservation support by properly applying the reserved tenancy when launching instances.
The Buffer Summary timeline now ranks ops by kernel duration, op-to-op gaps, DRAM utilization, FLOPS, or L1 fullness with color-coded badges, making it easier to spot bottlenecks at a glance. The visualizer also cleans up the MLIR upload flow with clearer progress messaging, fixes a multihost bug that was silencing non-host ops from the graph capture, and improves UI density by letting you toggle globally allocated circular buffers out of view when they clutter the per-core grid—plus several layout refinements to the cluster view and legend alignment that reduce visual overlap.
This release brings the VS Code Toolkit's lesson and cookbook content fully in line with Ubuntu 24.04 and QB2 environments by fixing Python interpreter references, venv activation paths across three deployment contexts (tt-developer-image, QB2 pre-installed, and cloud setups), and device management patterns that were triggering dispatch errors on multi-chip teardown. The updates touch seven major lessons—from game-of-life basics to vllm production deployment—ensuring developers can follow along without path or interpreter mismatches regardless of their setup. A particularly important fix replaces the broken per-chip device opening loop in cookbook-particle-life with coordinated CreateDevices/CloseDevices calls, eliminating a common source of teardown crashes on multi-device systems.
The v0.6.3 release brings performance improvements to the Defrag operation while sharpening reactivity features, making the tool more responsive when you're monitoring and managing your Tenstorrent systems. These optimizations build on the foundation laid in previous releases, ensuring smoother interactions as you work with larger or more complex device configurations.
The ttsim-QEMU bridge tooling gets steadier here. The install check no longer shells out to grep, using a pure-Node substring match instead, so it works regardless of whether grep is on PATH. The free-disk-space check now measures the actual ttsim-qemu target directory rather than your home folder — which matters when ~/sim lives on its own partition — and the threshold moves up to 12 GB to fit the 10 GB image. The qemu-bridge lesson was also reworded to make clear that the frictionless “pip install ttnn and go” flow is the intended end state, not where things stand today.
The local generator now ships with tt-animatediff v0.9.0 and introduces an assistive installer to streamline setup, alongside UI enhancements that should make the tool more intuitive to work with. A housekeeping change also adds the root .env file to gitignore, preventing accidental commits of local configuration.
This release adds support for Orion SLT silicon, expanding tt-flash's compatibility across Tenstorrent's hardware lineup. If you're working with Orion SLT devices, you can now use tt-flash for firmware management and deployment workflows without workarounds.
AnimateDiff now includes a Gradio-based web interface alongside full documentation and test coverage, making it more accessible for developers exploring text-to-video generation on Tenstorrent hardware. The addition of Lightning support streamlines integration with existing training workflows, reducing friction for those looking to fine-tune or experiment with animation models at scale.
The simulator now gives RISC-V tooling better access to device configuration through expanded PCI register support and --tt-device flags, while fixing SMP synchronization issues that were blocking real workloads. On the compute side, the QSR DM core can now use vectored exceptions for more efficient interrupt handling, and floating-point edge cases—like denormal flushing in SFPSTORE—now match hardware behavior more closely.
Tenstorrent Galaxy™ is now part of the Equinix Distributed AI Hub™. See how Equinix, BetterBrain, OrionVM, and Tenstorrent are building infrastructure for agentic AI: https://tenstorrent.com/hardware/galaxy
This release focuses on performance and debugger reliability across the board. The team refactored CMake configuration for tt-metal compatibility, moved callstack introspection to C++ for efficiency, and tackled flaky test failures on Wormhole EriSC and low-clock scenarios. Key improvements include faster MemoryMap operations, better frame reconstruction with tail call and inline frame expansion support, and more robust GDB server behavior in the test harness—changes that should make debugging Tenstorrent workloads smoother and more predictable.
We've released ttsim-riscv64, a lightweight full-system simulator that can boot Linux, giving developers a faster way to validate RISC-V workloads without hardware. Beyond that, we've expanded register support across ARC, PCIe, and tile domains, and added two important Tensix features: pack format conversion mode 0x111 and SFPSTORE mode LO16, which unlock new data manipulation patterns for compute kernels.
We built an agentic AI pipeline that continuously ports, compiles, and validates models from Hugging Face on Tenstorrent hardware. After thousands of models tested, the pass rate remains above 90%. Learn how with TT-Forge: https://tenstorrent.com/software/tt-forge
dstack now runs workloads on Ubuntu 24.04 by default, bringing your containers up to date with the latest LTS release while allowing explicit pinning to 22.04 if needed. The release adds three practical features: instance targeting to pin runs to specific fleet members (useful for instance-volume dependencies), gateway replicas for improved availability, and native support for AWS p6-b200 and p6-b300 instances including the new NVIDIA Blackwell B300 with 6.4 Tbps networking pre-configured for maximum throughput.
Prodia generated an 81-frame 720p video in just 2.4 seconds on Tenstorrent Galaxy™ Blackhole. See how we generate video faster than real time: https://tenstorrent.com/solutions/real-time-video
The visualizer gains early beta access to tt-mlir's model explorer for end-to-end MLIR workflows, while circular buffer visualization gets a substantial upgrade with per-core pressure modals and corrected envelope calculations that prevent double-counting of globally allocated buffers. Several stability fixes address stale focus states in MLIR views and mutation bugs in tensor sharding visualizations, plus the graph_report tool can now merge captures from multi-host runs while tracking rank information. The project has also switched to UV for Python dependency management, making local setup smoother for contributors.
This release improves error handling by ensuring non-zero exit codes propagate to the shell on failures, making it easier to catch issues in scripts and CI/CD pipelines. The README install instructions now point to uv for dependency management, aligning with modern Python packaging practices.
The simulator now covers more of the hardware stack, with expanded ARC, PCIe, Ethernet, and tile register support bringing its model closer to real silicon behavior. A few Tensix fixes and clearer error messages should also smooth out debugging workflows for developers working through simulation.
TT Studio v2.7.0 brings several quality-of-life improvements centered on deployment visibility and state management. The Voice Agent now supports wake-word and voice-activity detection for hands-free operation, while deployment progress tracking extends across Docker image pulls, container startup, and media model loading to give you continuous feedback. Behind the scenes, the tool cleans up its state handling with a unified source of truth for read operations, splits cleanup into granular (--cleanup) and full-reset (--cleanup-all) modes, and fixes lingering issues like stale startup detection, persistent terms acceptance, and accurate model performance reporting.
This update broadens the simulator's coverage of the chip's management and host-interface paths, filling in more ARC, PCIe, and tile register behavior so a wider range of low-level workloads exercise the same registers they would on real silicon. A round of minor Tensix fixes and small feature additions tightens compute-engine accuracy, and clearer error reporting makes it easier to tell why a run went wrong rather than just that it did.
The v1.8.3 release tightens up Ethernet and memory management in the simulator. Fixes to eth base firmware interactions in both idle and active erisc code eliminate quirks that could have masked real hardware behavior, while proper ARC_MSG_QCB_PTR register support brings the simulation closer to actual device firmware. Removing the MMIO DMA hack in favor of outbound iATU configuration is a meaningful architectural improvement that lets you test more realistic DMA scenarios across multiple chips. The addition of pack format conversion mode 0x15 for Tensix expands format coverage for those working on tensor operations.
We've uplifted Llama-3.3-70B-Instruct support on Blackhole QuietBox 2, with the implementation now running on P300X2 devices via tt_transformers. If you're working with Wormhole Galaxy, make sure you're on tt-smi 4.0.0, firmware 19.2.0, and tt-kmd 2.5.0 to stay compatible with this release.
dstack now lets you pin runs to specific fleet instances instead of always provisioning new resources when your preferred instance isn't available. You can target instances by name, hostname, IP address, or fleet reference—including instances from shared projects—giving you finer control over where workloads land and helping optimize for cost or data locality. This is particularly useful when you want to ensure a run uses a specific pre-configured machine or avoid triggering unnecessary provisioning.
tt-smi v5.3.0 brings better hardware monitoring and PCIe management capabilities. Fan speed reporting now shows RPM values instead of percentages across n300d and n150d devices, with datasheet-derived limits for more accurate thermal tracking. The release also adds RESET_PCIE_LINK support for Galaxy systems running compatible KMD versions and consolidates driver version checking and reset functionality directly into tt-smi, streamlining the toolchain.
ttsim now simulates the larger Wormhole and Blackhole multi-chip configurations, with the new wh_x32 and bh_x32 setups covering 32-chip Galaxy deployments. The release also expands host communication capabilities by adding outbound iATU, DMA, and host-to-device multicast support, while patching several fabric and multichip issues that affected P300 and T3000/LoudBox systems. This rounds out the simulator's coverage of Tenstorrent's production hardware and lets developers validate larger cluster topologies earlier in their workflow.
This release brings Zephyr 4.4.0 to Blackhole and Grendel, along with several quality-of-life improvements for Blackhole operations. Error reporting is now more granular—the STATUS_ERROR_STATUS0 register now accumulates init failures as a bit field so you can see exactly which stage failed (regulator, cable, Tensix, MRISC, or GDDR training) in a single read. Runtime firmware parameters like TDP limits can now be overridden and persisted across upgrades, though you'll want to be on tt-flash v3.8.0+ to ensure those overrides stick. On the connectivity side, P300C boards get PCIe Gen4 for better QB2 stability, and Ethernet firmware improvements address interrupt handling and add ECC support to MACPCS.
This release fixes a bug where the program counter couldn't be read reliably when a core was halted, and extends UmdApi to properly handle core reset logic on the QUASAR architecture. The test suite also gains conditional skipping for scenarios that don't apply to certain configurations, reducing noise in CI runs and making it easier to focus on relevant failures.
This release brings several refinements to dev environments and service deployments. Zed is now supported as a remote IDE option, with automatic server setup and convenient zed:// links for quick connection—useful if you're already in that editor ecosystem. On the services side, you can now specify spot_policy and reservation at the replica group level, enabling mixed strategies like running baseline replicas on reserved capacity while autoscaling overflow on spots; additionally, Shepherd Model Gateway workers now communicate via gRPC with both vLLM and SGLang, which reduces duplicate tokenization work and makes request routing more efficient. Azure users gain finer subnet control, JarvisLabs adds RTX PRO 6000 GPU support, and several reliability fixes address SSH connection pooling and service scaling edge cases.
Learn how you can simulate Tenstorrent hardware! Follow along with all 31 examples in https://docs.tenstorrent.com/tt-vscode-toolkit/lessons/ttsim-twenty-and-ten/
This release brings a Gradio UI frontend to the AnimateDiff implementation, making it easier to interact with the model through a web interface rather than command-line tools. The team also fixed layout responsiveness issues and laid groundwork for Lightning mode support, which will enable faster inference on Tenstorrent hardware in a future update. These changes lower the barrier to experimenting with AnimateDiff while keeping the door open for significant performance improvements.
This release improves the visualizer's reliability and usability across several fronts: better error messaging with scp fallback for sync issues, smarter graph rendering through edge deduplication and linked node navigation, and more flexible performance analysis with toggleable columns and graceful data reloading. You'll also find stronger support for diverse profiler formats and buffer allocation breakdowns, including pre-aggregated chunk handling and per-core CB calculations, plus cleaner stack trace navigation and more robust file handling across rank suffix variations.
The simulator now models larger Blackhole and Wormhole configurations, letting you test code targeting the 2-chip P300 and 8-chip T3000/LoudBox systems before running on hardware. A new multicast capability in host-to-device TLBs enables more realistic memory access patterns, and several Tensix core fixes improve overall simulation fidelity.
This release adds performance counters support for deeper profiling visibility, along with hardware-specific fixes for Wormhole and Blackhole devices that improve stability across different architectures. Thread-local storage now works reliably across multiple cores, and improved gcov integration makes coverage analysis more straightforward for developers validating their code paths. These changes round out the debugging and analysis toolkit for those working directly with Tenstorrent hardware.
This release brings visual refinements and usability improvements to the hardware visualization tool. You can now toggle between light and dark modes with improved contrast handling, while heatmap rendering gets a fix and responsive sizing so the visualization adapts better across different screen sizes. The update also includes security hardening to make the tool safer for production use.
v0.7.4 ships a plugin system that lets developers extend tt-local-generator with their own generation flows, alongside a new in-app log viewer and menu improvements that make the UI easier to navigate. Remix functionality is also enhanced in this release, giving users more control over iterating on generated outputs. Assets are distributed as .deb packages covering Flux, Mochi, Qwen3, SkyReels, Wan2, and AnimateDiff models.
Simulation support grows to cover multi-chip setups. The ttsim upgrade to v1.8.0 pulls down a new libttsim_wh_x2.so binary for a two-chip N300 Wormhole mesh alongside the existing single-chip Wormhole and Blackhole libraries. A new lesson walks through opening a MeshDevice(1, 2) and running a sharded element-wise add across two virtual chips, with a ready-to-run template and the matching cluster descriptor now copied into place during setup. It's a practical on-ramp for exercising mesh code without physical multi-chip hardware.
ttsim v1.8.0 extends the simulator's multi-chip capabilities with support for N300 (2-chip Wormhole) under the new wh_x2 configuration, and adds the RISC-V floating-point (F) extension to the Blackhole/QSR babyrisc model. Additional tile register functionality and another round of Tensix and QSR bug fixes continue to improve simulation fidelity.
tt-metal v0.72.0 adds Wan2.2-distill (LightX2V), Index-AniSora V3.2 video generation, and a LoRA adapter pipeline for Wan2.2 Image-to-Video, alongside SFPI 7.50.0. A new Metal 2.0 factory adapter and a fused DeepSeek MoE reduce-NC operation lay groundwork for upcoming architecture work, while targeted fixes address conv3d garbage output on batch-size changes, Gemma3-4B accuracy, DEVICE_PRINT linkage on Quasar, and a LoRA fusion state-dict corruption bug.
tt-kmd 2.9.0 brings the power-management framework from Blackhole 2.6.0 to Wormhole — placing devices in low-power state at probe and re-aggregating power flags on every open/close. The release also adds deferred idle power-down via a delayed work item, read-only DMA pinning for file-backed memory, and protection for telemetry reads against concurrent resets. Wormhole users must be on CMFW 19.10.0 or later for the new power policy to take effect.
tt-toplike v0.6.2 fixes power reporting by overlaying SMBUS TDP and TDC values onto the sysfs telemetry path, giving a more accurate picture of actual power consumption on Tenstorrent hardware. The release also adds an AI vendor-agnostic configuration setting for developer tooling flexibility.
ttsim v1.7.3 expands ARC and tile register coverage in the simulator and delivers another round of QSR and Tensix bug fixes with improved error reporting — incremental improvements that keep the simulation tracking closely with hardware behavior.
A web-build fix repairs broken media on the published site. Image, video, and link targets derived from GitHub-blob URLs were being emitted as bare /assets/img/... paths, which resolved to the domain root instead of the project's GitHub Pages sub-path and left assets like the particle-life simulation GIF broken. Those paths now run through the siteUrl() helper so they carry the correct base prefix, and a new link-validator test guards against missing local targets slipping through again.
tt-exalens v0.3.21 moves ELF and DWARF parsing from Python to C++, improving debugger performance and callstack evaluation accuracy. Reading and writing via the debug module's system bus is now enabled, remote communication tests gain data-consistency verification after halting RISC debug, and the GDB binary is removed from wheel packaging to reduce distribution size.
dstack 0.20.23 ships a tidy set of bug fixes and reliability improvements — proxy environment variables now correctly pass through to running containers, image pull progress reporting is more accurate, and latency in the run provisioning pipeline is reduced so workloads spin up faster. A fix for Verda spot offers being incorrectly marked unavailable rounds out the release.
tt-animatediff v0.1.0 is the first public release of AnimateDiff running on Tenstorrent hardware, opening up image-to-video generation to the Tenstorrent developer community. The release includes initial repository scaffolding, bug fixes, and documentation to get started.
This study addresses on-device inference bottlenecks of Transformer models on Tenstorrent's Tensix architecture and proposes an operator fusion strategy that enhances data locality. RMSNorm is fused with matrix multiplication in self-attention and in the FFN, enabling back-to-back execution of memory-bound and compute-bound operators in on-chip SRAM to significantly reduce DRAM reads/writes of intermediate results and scheduling overhead. To support multi-core parallelism, a NoC-based multicast mechanism is leveraged in which row/column master nodes efficiently distribute inputs and weights across the core mesh, alleviating DRAM bandwidth contention. Experiments on the Wormhole platform with Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B show up to 37.44% latency reduction for attention and…
The TTNN Visualizer adds L1 fragmentation pressure to the performance table, a performance overlay on the graph view showing per-operation device time, and inspectable MLIR nodes with attribute/input/output side panels. Navigation through performance charts also gets a new in-page dropdown for easier exploration.
tt-umd v0.9.6 broadens platform support with new manylinux ARM64 (aarch64) Python wheels alongside existing x86_64 builds, making it easier to develop on a wider range of machines. Buffer handling in noc_read and dma_read gains enhanced guards, the Chip and TTDevice constructors are refactored for cleaner separation, and ethernet broadcast no longer depends on the full Cluster stack. Available as DEB/RPM packages and Python wheels for Python 3.9–3.13.
tt-toplike v0.6.1 polishes the terminal monitoring interface with layout jitter fixes that keep the display steady during live updates, enhanced compact mode for smaller terminal windows, and improved firmware version display — small but meaningful ergonomic wins for anyone watching their Tenstorrent cards in a busy terminal.
ttsim v1.7.1 adds the Zbc (carryless multiplication) RISC-V extension to the QSR RV64 model, bringing the simulator closer to full ISA coverage. The release also delivers expanded ARC and PCIe tile/TLB functionality, a round of Tensix bug fixes, and improved error message reporting throughout the simulation stack.
Blackhole gains a new TT_SMC_MSG_TOGGLE_ETH_RESET API for resetting ethernet tiles, while Wormhole gets power improvements including linking AICLK_BUSY to the max AI clock bit. Board-specific documentation is now auto-generated from protobuf definitions, and DRAM low-power mode is disabled on instances prone to slow retraining.
tt-flash v3.8.0 adds an important safety improvement for Blackhole systems: the CCFG override (ccfgovr) table is now preserved across firmware flashes, preventing accidental overwrite of critical configuration settings. A buffer boundary fix in boot_fs prevents read_tag from walking past the end of the buffer, closing a potential crash path during flash operations.
tt-exalens v0.3.20 brings Rocket Core support for debugging the new RISC-V implementation in Blackhole, adds an aarch64 build for SFPI (bumped to version 7.49.0), and makes ELF parsing thread-safe. Remote chip performance on T3K configurations sees concrete improvements, and a JAL offset calculation fix in the RISC-V debug path closes a correctness bug.
The inference server now ships its first Helm chart, making it straightforward to deploy any supported model spec on Kubernetes. This release also brings updated support for SpeechT5 TTS, Whisper large-v3, and Distil-large-v3 across both P150 and the new Blackhole QuietBox 2 hardware.
The first tagged release of BarraCUDA introduces CPU and RISC-V backends for Triton kernels — you can now compile a tl.dot matmul and run it natively on x86-64 or under QEMU with no GPU required. Cross-backend differential testing is included from the start, using the CPU backend as an oracle to verify correctness across compilation targets.
The AnimateDiff example gets a round of correctness and portability fixes. Cleanup now calls close_mesh_device to match the MeshDevice it actually creates, and the ttnn import was made lazy so the module loads on CPU-only machines like CI runners. Several docstrings and comments were corrected to reflect that frames are denoised sequentially and that the Phase 3 path still bounces tensors to CPU, and stale setup.py metadata was cleaned up. New hardware-free PyTorch tests cover the cross-frame attention helper, and CI now runs the full test suite rather than a single file.
dstack 0.20.22 expands Tenstorrent hardware support from Wormhole to Blackhole — PCIe cards, LoudBox, QuietBox, and Galaxy systems are now first-class dstack targets. The Vast.ai backend gains fine-grained offer filtering (min reliability, min score, offer ordering), a new Miles reinforcement learning example shows 32B GRPO training across a multi-node cluster, and AWS P3/V100 support is sunset.
tt-forge 1.2.0 ships the latest model uplifts from the forge models repository alongside a suite of new Tenstorrent ecosystem skills for automated Claude-based workflow testing. The release continues rapid iteration on CI automation, with AI-driven installation verification and an expanding test matrix against the latest model weights.
ttas v0.1.0 reaches a significant milestone: all 128 Wormhole b0 instructions are now fully cross-checked against Tenstorrent's canonical TT_OP_ macros in ckernel_ops.h, verifying opcodes, start bits, and field widths across the complete ISA. There's one breaking change — positional operand order in .tts files now follows the TT_OP_ macro signature rather than ascending start_bit order; named operands are unaffected.
ttnn-visualizer v0.87.0 improves the report loading experience with richer progress feedback during upload and remote sync, and now surfaces stack trace file contents stored in the profiler database directly in the UI. L1 Small tensors get their own filter and display path in the tensor list, and empty NPE traces are now rejected with a clear error rather than silently loading as broken reports.
ttas v0.0.1 is the first public release of a community-built assembler and disassembler for Tensix, the compute engine in Tenstorrent accelerators. It covers all 128 Wormhole b0 instructions generated from tt-llk's assembly.yaml, assembles .tts text into 32-bit Tensix words (binary, hex, or annotated C arrays), and shares a single instruction table between assemble and disassemble so the two paths can never drift apart.
tt-toplike v0.6.0 upgrades the Insights screen with process management capabilities and power analysis, giving operators a clearer view of what's running on their Tenstorrent hardware and how much power it's drawing — all from the familiar terminal monitoring interface.
A landmark release: ttsim v1.7.0 marks the initial public, open-source release of the Wormhole and Blackhole chip simulators, letting developers write and validate Tensix kernels, debug firmware, and explore hardware behavior without physical silicon. The release includes numerous QSR and Tensix bug fixes, improved error reporting, and optimized build flags — a powerful new resource for the broader Tenstorrent developer community.
tt-local-generator v0.3.3 adds AnimateDiff video generation to the local generation toolkit alongside broader artgen UX improvements. Running fully on Tenstorrent hardware, this expands the creative possibilities for developers experimenting with on-device generative AI beyond still images.
ttsim v1.6.4 adds ARC Control Store Memory (CSM) support on Wormhole to the simulator, making more of the hardware's debug and diagnostic path accessible in simulation. RISCV_DEBUG_REG_DBG_FEATURE_DISABLE is now readable (not just writable), and a round of Tensix bug fixes and improved error reporting further refines simulation accuracy.
dstack 0.20.21 adds JarvisLabs as a new cloud backend (including spot GPU instances) and brings multi-cluster Kubernetes support, where a single backend config can manage multiple clusters via kubeconfig contexts — each becoming its own dstack region with independent proxy and namespace settings. A handy update for teams running Tenstorrent workloads across heterogeneous cloud infrastructure.
TT Studio gains a Voice Agent wizard for multi-chip solution deployments, walking users through device allocation, model compatibility checks, and slot conflict resolution across multi-device setups. The release also ships an in-app bug reporting tool that automatically gathers service and runtime logs — making it easier to submit useful diagnostics without leaving the UI.
ttnn-visualizer v0.86.0 adds interactive I/O graph node highlighting when a focused operation is selected, refreshes the memory legend iconography for improved clarity, and optimizes the JavaScript bundle for the hosted deployment. Out-of-bounds memory items and tensors are now dimmed when an L1 memory range is set, and the project bumps to Node v24 LTS to restore Dependabot support.
A small but real papercut in the AnimateDiff Phase 2 walkthrough is fixed. The 16-frame run template was a single-quoted string, so a prompt like “World's Fair” needed an awkward backslash escape. Switching to a template literal lets the apostrophe reach the shell cleanly, so the example works exactly as written.
tt-inference-server v0.14.0 expands Blackhole QuietBox 2 (p300x2) model coverage significantly, adding FLUX.1-dev image generation, Wan2.2-T2V-A14B video generation, Llama-3.1-8B, Qwen3-32B, and Qwen3-VL-32B-Instruct. The CLI's run.py now clearly indicates which release version to use for models not yet uplifted since v0.10.0, making version navigation easier for operators managing multi-model deployments.
tt-flash v3.7.0 ships a self-contained PyInstaller binary as a new release artifact, making it possible to run the firmware flash tool on systems without a Python environment. As part of this transition, Debian package builds are sunset — the binary is the new recommended installation path for systems where pip isn't available.
dstack 0.20.20 adds support for NVIDIA Dynamo prefill-decode (PD) disaggregated inference, allowing replica groups to declare a Dynamo router and use SGLang, vLLM, or TensorRT-LLM as the inference backend — a powerful configuration for teams running large-scale inference on mixed hardware including Tenstorrent systems.
tt-smi v5.2.0 updates the UMD dependency from 0.9.4 to 0.9.5, picking up the latest hardware interface improvements and bug fixes from the underlying Tenstorrent device management library.
Kernels written by and for tt-lang are loaded to Blackhole chips on the Tenstorrent Quietbox 2, powering all forms of "AI" in the open source Civilization clone, FreeCiv. Look at all those fish!
A 6,500-word community deep dive into the Blackhole p100a architecture: the tile model (Tensix, DRAM, SiFive x280 L2CPU, Ethernet, PCIe, NoC arc), firmware startup sequence, MOP micro-op processor, replay buffer, FPU/SFPU sync, and the anatomy of a kernel. From the author of blackhole-py.
Lecture 20 from William & Mary's graduate Computer Architecture course. Frames Tenstorrent in the landscape between GPUs and TPUs, draws comparisons to Cerebras and SambaNova, then dives deep into the Wormhole chip and Tensix core: the 5 RISC-V core design, SFPU, NoC, and dataflow execution model.
A fused kernel for the Grayskull architecture implementing Transformer self-attention entirely within SRAM. Combines matrix multiply, attention score scaling, and Softmax without DRAM accesses, achieving significant speedups over non-fused implementations.
Evaluates the Tenstorrent Grayskull e75 RISC-V accelerator for matrix multiplication at reduced numerical precision (BFP8 and LoFi), a fundamental kernel in LLM inference computation.
Evaluates three strategies for scaling an N-body code across multiple Tenstorrent Wormhole accelerators. Builds on the established performance of single-card N-body work to explore parallelism via the on-chip NoC and multi-accelerator configurations.
Compiler system that automatically generates efficient dataflow plans for tile-based languages on spatial accelerators including Tenstorrent Wormhole. Exploits on-chip network forwarding between processing elements to reduce DRAM pressure.
Shows that Text-to-Speech inference on Tenstorrent Lightning V2 achieves 4× lower cost than NVIDIA L40S. Applies BlockFloat8 (BFP8) and low-fidelity (LoFi) precision strategies to TTS despite their greater numerical fragility compared to LLMs.
Tenstorrent Low-Level Kernels: the C++ library that directly programs the RISC-V cores inside each Tensix compute engine. TRISC0 (unpack), TRISC1 (math/FPU/SFPU), and TRISC2 (pack) are all programmed through this layer — it is the interface between TT-Metal kernel code and bare silicon.
ttnn-visualizer v0.85.0 introduces an experimental MLIR graph view in dev mode for inspecting compiler intermediate representations — a powerful new tool for compiler developers. Operation graph navigation now preserves focus context when moving to input/output operations, a scroll-to-tensor feature brings the Buffers view directly to the selected tensor, and database queries switch to explicit column names for better schema resilience.
tt-exalens v0.3.19 bumps UMD to 0.9.5 and adds a SIGBUS recovery mechanism in UmdDevice and UmdApi — an important reliability improvement for hardware debug sessions where PCIe access errors can otherwise crash the tool rather than being handled gracefully.
Maps 2D 5-point stencil computations onto the Tenstorrent Wormhole RISC-V AI dataflow accelerator via two implementations: element-wise decomposition (Axpy) and matrix-multiplication reformulation (MatMul). Profiling shows the isolated Wormhole kernel is competitive with CPU execution, with PCIe transfers and initialization driving end-to-end overhead; Axpy achieves lower energy than the CPU baseline at large scales. Identifies architectural and software directions for making AI accelerators viable for HPC stencil workloads. 2025.
tt-smi v5.1.1 is a quick follow-up that fixes a missing CSS file in the PyInstaller standalone binary, restoring the proper visual appearance of the system management interface when running as a self-contained executable.
New release: whisper 1.861
tt-exalens v0.3.18 adds performance counter debug registers for deeper hardware profiling, fixes a regex pattern issue in ICCM ID extraction, and resolves a display bug in the process info view — solid improvements for firmware and hardware debugging workflows.
Three interface fixes land here. The first-install theme logic now respects a theme set at workspace or workspace-folder scope, not just the global one, so it won't overwrite a deliberate per-project choice. The Tensix visualizer's play button again responds to labelled controls after a duplicate comparison was replaced with a prefix check, and the cluster visualizer's dot-mode animation no longer desyncs its column count from initialization. The latter two fixes were also backported upstream.
tt-smi v5.1.0 adds GDDR firmware version to the firmware table for more complete hardware visibility, ships PyInstaller binaries as official release artifacts, fixes a hang when using the eth_train_skip option with the UMD backend, and refactors the reset parsing logic into its own module for better maintainability.
tt-sim v1.0 is the first tagged release of a community-built Tenstorrent hardware simulator, providing a simulation environment compatible with the mesham/tt-metal fork. This gives developers an accessible way to explore tt-metal on Tenstorrent architecture without requiring physical hardware.
Boltz-2 biomolecular model for drug discovery on Tenstorrent Blackhole. Supports single-card and multi-card configurations — QuietBox (4×) and Galaxy (32×). Approaches physics-based FEP accuracy at 1000× the speed.
Deep-dive into the Tenstorrent architecture and Metalium programming model — circular buffers, kernel synchronization, NoC routing, and where the footguns are. The honest guide to thinking in Tensix.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Sponsored series of deep technical articles on implementing optimal SFPU kernels for the Tenstorrent Wormhole and Blackhole vector units. Covers where, typecasting, 16/32-bit integer multiplication, cube root, and accurate sin/cos/tan — with cycle counts, assembly walkthroughs, and Blackhole vs Wormhole comparisons throughout.
Step-by-step guide to getting a Tenstorrent card running on Arch Linux with the full Metalium stack. Practical troubleshooting from someone who did it the hard way first.
Honest field notes from getting a Grayskull card running and writing first Metalium kernels. Covers setup pitfalls, processor hangs, memory protection quirks, and what makes Metalium compelling despite early rough edges.
Ports the Cooley-Tukey FFT algorithm to the Wormhole n300 RISC-V accelerator. The Wormhole draws 8× less power and consumes 2.8× less energy than a 24-core Xeon Platinum for a 2D FFT. ISC 2025.
Ports the Cooley-Tukey FFT algorithm to the Wormhole n300 RISC-V accelerator. The Wormhole draws 8× less power and consumes 2.8× less energy than a 24-core Xeon Platinum for a 2D FFT. ISC 2025.
Accelerates an astrophysical N-body simulation on the Wormhole n300. Achieves 2× speedup and 2× energy savings over a highly optimized CPU implementation. SC '25 Workshop.
Accelerates an astrophysical N-body simulation on the Wormhole n300. Achieves 2× speedup and 2× energy savings over a highly optimized CPU implementation. SC '25 Workshop.
Implements three numerical kernels and composes them into a conjugate gradient solver on Wormhole. Demonstrates AI accelerators merit consideration for HPC workloads traditionally dominated by CPUs and GPUs. 2026.
Explores stencil computation on the Grayskull PCIe RISC-V accelerator. Early academic work examining TT hardware for HPC stencil workloads. 2024.
Makes multi-tenant NPU sharing practical for Blackhole-class hardware using polynomial-time allocation algorithms. Delivers up to 1.37× higher utilization and 1.14× faster workload completion. Up to 890,000× faster than NP-hard baselines.
Three agentic projects running fully on-device: local AI agents on QuietBox 2, a coding assistant powered by Aider against a local inference server, and the OpenClaw AI assistant on QuietBox 2. No cloud APIs — all inference runs on TT hardware.
Three agentic projects running fully on-device: local AI agents on QuietBox 2, a coding assistant powered by Aider against a local inference server, and the OpenClaw AI assistant on QuietBox 2. No cloud APIs — all inference runs on TT hardware.
Three lesson-projects covering on-device video synthesis: frame-by-frame diffusion with tt-local-generator, native AnimateDiff video animation, and video generation on QuietBox 2. All run entirely on TT hardware with no cloud dependency.
Three lesson-projects covering on-device video synthesis: frame-by-frame diffusion with tt-local-generator, native AnimateDiff video animation, and video generation on QuietBox 2. All run entirely on TT hardware with no cloud dependency.
Three lesson-projects covering on-device video synthesis: frame-by-frame diffusion with tt-local-generator, native AnimateDiff video animation, and video generation on QuietBox 2. All run entirely on TT hardware with no cloud dependency.
Particle Life simulation on Tenstorrent hardware — an emergent-behavior N-body system where simple attraction/repulsion rules between species produce complex lifelike patterns. Cookbook recipe demonstrating parallel N-body compute on Tensix.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
Seven-module computer science curriculum taught on real Tenstorrent hardware. Covers RISC-V architecture, memory hierarchy, parallel computing, networks and NoC, synchronization, abstraction layers, and computational complexity — all grounded in what is physically happening on the chip.
A Tenstorrent-powered claw machine that rewards players with real prizes. The QuietBox 2 runs local AI inference to act as an agent controlling the claw hardware — the OpenClaw AI assistant lesson builds directly on this project.
On-device image generation with Stable Diffusion XL running entirely on Tenstorrent hardware. Full inference pipeline with no cloud dependency.
End-to-end image classification project using TT-Forge — compile and run a PyTorch classification model on Tenstorrent hardware with no kernel authoring required.
Interactive browser-based visualizer of the Tenstorrent Tensix grid architecture. Explore the NoC, core layout, and dataflow patterns without hardware — a great companion for learning kernel programming.
TT-Metalium implementation of Conway's Game of Life as a cookbook recipe. Each generation is a full parallel kernel dispatch over the grid — a clean introduction to stateful compute on Tensix cores.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Eight-lesson series covering the full custom training workflow on TT hardware: dataset fundamentals, configuration patterns, fine-tuning, multi-device distributed training, experiment tracking, model architecture basics, and training from scratch.
Three hands-on TT-Metalium kernel recipes: a Mandelbrot fractal explorer, real-time audio signal processing pipeline, and custom image filter stack. Each recipe is a complete kernel project with full source in the lesson.
Three hands-on TT-Metalium kernel recipes: a Mandelbrot fractal explorer, real-time audio signal processing pipeline, and custom image filter stack. Each recipe is a complete kernel project with full source in the lesson.
Three hands-on TT-Metalium kernel recipes: a Mandelbrot fractal explorer, real-time audio signal processing pipeline, and custom image filter stack. Each recipe is a complete kernel project with full source in the lesson.
Three hands-on TT-Metalium kernel recipes: a Mandelbrot fractal explorer, real-time audio signal processing pipeline, and custom image filter stack. Each recipe is a complete kernel project with full source in the lesson.
Fast full-system simulator of Tenstorrent Wormhole and Blackhole hardware. Runs TT-Metalium workloads on any Linux/x86_64 system without physical silicon. Bit-exact results relative to hardware.
PJRT device plugin for Tenstorrent hardware. Enables JAX, PyTorch/XLA, and other XLA-based frameworks to target TT accelerators.
Production-ready model serving for Tenstorrent hardware with OpenAI-compatible REST API. Supports continuous batching, multiple models, and all TT hardware configurations.
Python-based DSL that sits between TT-NN and TT-Metalium — expresses custom fused kernels with progressive disclosure, compiling directly to Tensix. Ships an integrated functional simulator (no hardware needed), line-by-line performance metrics, and AI-agent-friendly tooling. Two packages: tt-lang (compiler + hardware, requires ttnn) and tt-lang-sim (simulator only, works on Linux/macOS without Tenstorrent hardware).
Install the complete Tenstorrent software stack with one command. Handles drivers, firmware, Python environment, and SDK setup automatically.
48 interactive lessons covering the full Tenstorrent developer path — from hardware detection to custom training — with click-to-run commands and hardware auto-detection. Available in VSCode and code-server.
48 interactive lessons covering the full Tenstorrent developer path — from hardware detection to custom training — with click-to-run commands and hardware auto-detection. Available in VSCode and code-server.
tt-local-generator v0.2.6 adds SkyReels Image-to-Video (I2V) generation, expanding the tool's creative capabilities beyond text-to-image. The release also delivers artgen UX improvements and a website refresh, rounding out a growing local AI media generation toolkit that runs entirely on Tenstorrent hardware.
ttnn-visualizer v0.84.1 fixes a crash that could occur when checking configuration paths after a report had been deleted, and pins mypy to a known-good version to keep the type-checking pipeline stable.
TT-Deploy, May 1st, 2026 – Customers running Tenstorrent in production — David Bennett (AI&), Alex Nataros (Cirrascale), Sanchayan Sinha (Turiyam), and Mike Gorbinski (Virtu Financial) — share why they chose our hardware and what they're building on it.
TT-Deploy, May 1st, 2026 – Jim Keller reads our love letter to DeepSeek v4, reviews how Tenstorrent achieves unlimited scale, and shows off our video gen benchmark on Artificial Analysis. He closes with Tenstorrent is where AI runs.
TT-Deploy, May 1st, 2026 – Stan Sokorac, Sr. Fellow, Software, reveals the results of an agentic pipeline that continuously tests random Hugging Face models on Tenstorrent hardware: a 90% pass rate, projecting to roughly 2.5 million models. Plus a deep dive on TT-Lang, TT-Forge, and the only 100% open-source high-performance AI software stack.
TT-Deploy, May 1st, 2026 – Tenstorrent's Amr Elashmawi sits down with Justen Aguillon (Equinix), Sheng Yeo (OrionVM), and Abhishek Bhargava (BetterBrain) to walk through the Equinix Distributed AI Hub — a full-stack sovereign agentic AI platform now live in Ashburn.
TT-Deploy, May 1st, 2026 – Jasmina Vasilović, Senior Fellow of ML Frameworks & Programming Models, walks through the Tenstorrent software stack and how Blackhole's switch-free architecture extends the low-cost serving curve where GPU economics collapse — plus a partner spotlight on Prodia's faster-than-real-time WAN 2.2 video generation.
TT-Deploy, May 1st, 2026 – Jim Keller walks through the fundamentals – scale, general-purpose, and lower cost – that enable Tenstorrent to be built for the constant changing landscape of AI.
Join as we unveil Tenstorrent’s AI solutions deployed at scale. See the full breadth of what we've built — validated by real architecture, benchmarks, and customer deployments.
A release-pipeline fix corrects how the Open VSX publish step is gated. GitHub Actions doesn't expose the secrets context in step-level if: conditions, so the original check never behaved as intended. The step now tests env.OVSX_PAT — mapped from the secret in the same step — matching the pattern already used for the Marketplace publish job, so conditional publishing works reliably.
This release leans into hardening the lesson webview. External-link handling now opens only http and https URIs, ignoring file:, vscode:, command: and other schemes so webview messages can't trigger unintended actions, and YouTube embeds are converted to thumbnails before HTML sanitization runs, with the layout styles preserved through the sanitizer. Rounding it out are a click handler that guards against non-element targets, a broadened SVG regex so the agent sun-bleed effect fires reliably, and a stray bracket removed from a user-facing context-limit message.
dstack 0.20.19 adds a configurable time window for autoscaling RPS calculation (30s, 1m, or 5m), along with Kubernetes support for private registry authentication via imagePullSecrets and read-only volume mounts — giving teams running Tenstorrent inference servers more control over scaling behavior and deployment security.
All these videos were generated at home on Tenstorrent TT-QuietBox 2 using WAN 2.2 and SkyReels.
Using Qwen3-32B is used in three inter-connected agent demos. Learn how on your own: https://docs.tenstorrent.com/tt-vscode-toolkit/lessons/qb2-local-agents/ Visualizer is tt-toplike: https://docs.tenstorrent.com/tt-toplike Time is condensed in multiple segments.
tt-toplike v0.5.0 fixes border and centering issues for consistent layout across terminal sizes and introduces a --mock N shorthand for quickly launching a mocked view with N simulated devices — handy for development and demonstrations without requiring physical Tenstorrent hardware.
tt-local-generator v0.2.2 refreshes the homepage and expands the built-in prompting bank, making it easier to discover and use the range of generation capabilities available through this local AI toolkit running on Tenstorrent hardware.
tt-local-generator v0.2.1 fixes environment variable handling in the Debian package configuration, ensuring the installer correctly reads and applies user-provided settings on first install.
tt-toplike v0.4.3 refines the CI build pipeline with improved dependency handling for Debian package builds, smoothing the release process for this terminal monitoring tool for Tenstorrent hardware.
tt-toplike v0.4.2 establishes proper GitHub Actions release and CI workflows, putting this terminal hardware monitor on a solid automated foundation for consistent, tested releases going forward.
tt-inference-server v0.13.0 marks the first release of Forge-based models — ResNet-50, VoVNet, MobileNetV2, SegFormer, and ViT — powered by the newly released tt-forge library 1.0.0. Multiple models graduate from EXPERIMENTAL to FUNCTIONAL or COMPLETE status on N150/N300, expanding the range of production-ready computer vision workloads on Wormhole hardware.
tt-exalens v0.3.17 is a targeted fix that corrects the send_tensix_risc_reset function's keyword argument to match current UMD bindings, restoring RISC reset functionality that had broken with recent UMD API changes.
tt-smi v5.0.1 adds a GitHub Actions workflow for building PyInstaller standalone binaries, laying the foundation for distributing tt-smi as a self-contained executable that works on machines without a full Python environment.
Each chip is compiling independently from the others. Some time manipulation to fit within a minute, this covers about 14 minutes of compilation. Find the tt-forge-compiletron here: https://github.com/tsingletaryTT/tt-forge-compiletron
tt-exalens v0.3.16 resolves intermittent CI failures caused by an underlying UMD bug, restoring reliable test runs and keeping the development pipeline green.
Using Qwen3-32B is used in three inter-connected agent demos. Learn how on your own: https://docs.tenstorrent.com/tt-vscode-toolkit/lessons/qb2-local-agents/ Visualizer is tt-toplike: https://docs.tenstorrent.com/tt-toplike
TT System Firmware v19.9.0 delivers significant power and performance improvements for Wormhole: DRAM low power mode gets targeted fixes for slow-retraining instances, Tensix RISCs are now properly reset before entering clock-gating state, and PCIe low power mode is temporarily disabled while unresolved issues are addressed. Fan telemetry gains RPM reporting and a corrected speed percentage, and Blackhole gains support for reading VDD_MIN/VDD_MAX from the firmware table.
TT Studio v2.5.0 is a major capability release: the model catalog now syncs from the TT Inference Server artifact and covers LLM, TTS, STT, and VLM model types with per-device compatibility indicators. A new ChipSlotAllocator handles automatic chip assignment across deployments, a pulsing progress bar replaces the static deployment wait screen, and a dedicated Docker Control Service improves security by replacing direct socket access — with deployment timeouts extended to 5 hours for large model downloads.
tt-flash v3.6.5 fixes the reset path for revision C Galaxy systems, ensuring IPMI reset is always used — an important correction for safe firmware flashing on the latest Galaxy hardware.
This release upgrades to Linux Kernel 7.0 and adds device tree support for performance monitoring, letting you track system metrics directly from userspace. The kernel now ships with a patch that eliminates spurious swiotlb errors during boot, cleaning up initialization logs and making debugging easier. Several fixes to the console tool and build system round out the update.
tt-bh-linux v0.11 updates to Kernel 7.0 and adds device tree nodes for Linux performance monitoring, enabling perf tooling on Blackhole hardware. A kernel patch eliminates swiotlb initialization errors that appeared on boot, and the console tool and Makefile receive miscellaneous fixes.
tt-flash v3.6.4 wraps IPMI reset operations with the required USER_RESET ioctl before and POST_RESET ioctl after, ensuring the complete reset sequence is honored — a prerequisite for safe firmware flashing on Galaxy systems alongside the new tt-kmd 2.8.0 driver.
tt-flash v3.6.3 reorganizes Wormhole-specific flash logic into its own module, adds the ability to download firmware directly from tt-system-firmware GitHub release assets with local caching, and introduces a guard that prevents P300 flashing unless both chips are detected — avoiding partial flash scenarios that could leave a dual-chip card in an inconsistent state.
tt-inference-server v0.10.1 is a targeted reliability fix: a deadlock in the media server that could hang inference under certain conditions is resolved, and the trace region size for Llama-3.1-70B models is increased to prevent trace buffer exhaustion during long-running sessions.
The PCIe kernel driver gains stable, topology-derived device ordinals for Galaxy systems, replacing probe-order numbering that could vary across reboots. Hotplug processing is suppressed on Galaxy to prevent PCIe link speed degradation, and a DMA bug affecting Thunderbolt-connected devices is fixed — Galaxy users should also upgrade to tt-smi v5.0.0 and tt-flash v3.6.4 alongside this driver.
tt-inference-server v0.12.0 adds experimental support for DeepSeek-R1-0528 and OpenAI's gpt-oss-120b on Wormhole Galaxy and Dual-Galaxy configurations, establishing early foundations for running frontier-scale models on Tenstorrent hardware. Both carry experimental status (R1-0528 can hang during decode; gpt-oss-120b occasionally produces incomplete special tokens), alongside updated multi-host documentation.
tt-smi v5.0.0 implements the full Galaxy reset sequence (USER_RESET before IPMI reset, POST_RESET after) required by the new tt-kmd 2.8.0 driver — Galaxy users should upgrade to this version alongside the kernel driver update. The frontend formatting logic is refactored into its own module, and legacy glx_reset_tray support is removed.
luwen v0.8.5 adds support for the PCIE_LINK_SPEED power flag and gracefully handles EINVAL errors from the set_power_state ioctl when running against older firmware that doesn't support the new power state interface — ensuring backward compatibility with older firmware versions across the Tenstorrent hardware stack.
luwen v0.8.3 delivers a complete overhaul of the CI pipeline, establishing a robust automated testing foundation for this Rust-based hardware abstraction library that underpins much of the Tenstorrent software stack.
TT Studio v2.4.1 replaces the tt-inference-server submodule with a cleaner artifact-based integration supporting configurable branches and release versions with automatic re-download on change. Time to First Token (TTFT) and Time Per Output Token (TPOT) metrics arrive for live performance monitoring during inference streaming, alongside a dedicated Docker Control Service that replaces direct socket access with a more secure abstraction.
luwen v0.8.2 fixes a sign error in 16.16 fixed-point temperature decoding that could produce incorrect readings for cold hardware, adds Blackhole support for boards with multiple bootfs headers, and skips unnecessary DRAM initialization waiting on Wormhole — alongside several board ID debug and flash diagnostic utilities.
tt-umd v0.9.4 delivers focused infrastructure improvements: TopologyDiscoveryOptions is refactored for cleaner configuration, Ethernet link retraining on 6U chassis gets dedicated support, DRAM retrain capability is added, and the DeviceProtocol interface evolves to better accommodate the TTSim path. Available as DEB/RPM packages for Ubuntu 22.04/24.04, Fedora 39, and Python wheels for Python 3.9–3.13.
tt-installer v2.2.1 updates the firmware download URL to the new endpoint, ensuring the one-line installer correctly fetches the latest firmware during fresh Tenstorrent hardware setup.
tt-installer v2.2.0 adds optional Forge container installation to the setup flow — now you can set up a complete tt-forge development environment as part of the initial hardware installation process, with the Forge container defaulting to off for users who don't need it.
This release fixes a critical SIGBUS crash that could occur after device reset by reordering the chip re-detection sequence—a subtle but important ordering issue that was causing memory access violations. If you've encountered hangs or crashes when resetting devices in your workflows, this update should resolve those stability problems.
This release continues hardening the UMD stack with improved stability and performance across multiple fronts. Key changes include moving TLB configuration back to userspace to eliminate kernel-mediated performance penalties, implementing full SPI support for both Wormhole and Blackhole architectures, and fixing a subtle hash collision bug in coordinate pairs that was causing kernel cache corruption. The team also shipped safer device read/write APIs to protect against SIGBUS errors, updated device filtering to use stable logical IDs instead of volatile PCI IDs, and resolved numerous static analysis warnings to improve code quality. Python wheels now support Python 3.9–3.13 on modern Linux distributions, and system packages are available for Ubuntu 22.04/24.04 and Fedora 39.
This release brings memory and stability improvements across Tenstorrent's platform lineup, with Wormhole now supporting Samsung GDDR6 and both platforms implementing security wipes during initialization. On Blackhole, the p300 gains critical fixes for reboot reliability and expanded board support with Galaxy revC, while p300c power limits have been tuned (550W board power, 125W TDP, 88°C GDDR thermals). The telemetry layer also gets new visibility into AICLK arbitration decisions with granular tracing, helping developers debug frequency scaling behavior more effectively.
This release brings the Linux stack for Grayskull and Wormhole to kernel 6.19 with OpenSBI 1.7, leveraging upstreamed device tree support that's now part of the mainline kernel. The update improves compatibility with newer kernel mode driver and firmware versions, particularly around power management capabilities, and adds proper multi-card system support—a key requirement for deployments like quietbox that span multiple accelerators.
The PCIe driver now automatically invalidates userspace memory mappings during device resets, ensuring applications see faults rather than reaching undefined device state—though this means you'll need to close and reopen device files after a reset if your code was relying on the previous behavior. New features include blocking lock acquisition for multi-process coordination, standardized ERISC lock indices, and sysfs telemetry for firmware heartbeat and thermal shutdown events. The release also patches a kernel quirk affecting Blackhole Galaxy link speeds on Linux 6.5–6.12 and fixes several race conditions in device removal, secondary bus resets, and suspend/resume handling.
This patch release fixes snapshot mode to generate valid JSON output, addressing a parsing issue that would have affected downstream tooling. The update also introduces loop snapshot mode for continuous monitoring and extends snapshot data to capture process-level details and encode/decode statistics, giving users more granular visibility into system activity.
The Zephyr Platforms firmware v19.5.0 brings stability improvements across Wormhole and Blackhole platforms, alongside a significant change to Blackhole's core configuration. Starting with this release, all Blackhole p150 cards now report 120 Tensix cores instead of 140, unifying the interface across software layers—expect roughly 1–2% performance variance and potential grid layout changes in metal. On the connectivity side, Wormhole's ERISC firmware gets refined link training and signal integrity adjustments, while Blackhole gains Manual EQ training, improved SerDes initialization sequencing, and expanded telemetry for clock arbitration and power monitoring. Be sure to upgrade tt-flash to 3.6.0 or later to avoid losing board IDs on Blackhole during the update.
This release introduces mesh_v2 layout support and multi-host programming capabilities to tt-topology, enabling developers to work with expanded device configurations across multiple hosts. The changes include coordinate mapping adjustments for the new layout option and standardized constant formatting to align with the rest of the codebase, making it easier to integrate multi-host setups into existing Tenstorrent workflows.
This release aligns tt-topology with the latest SMI (System Management Interface) compatibility by bumping pyluwen and tools-common dependencies to version 72. The update ensures consistent versioning across the Tenstorrent toolchain, which helps maintain stability when coordinating with other system management components. These are primarily maintenance changes that keep the topology layer in sync with broader platform improvements.
UMD v0.9.1 is now available on PyPI, making it simpler to install the Python bindings across Python 3.9–3.13 on modern Linux distributions. Beyond the PyPI publication, this release includes fixes for 6U device reset performance (reducing warm reset time from ~6 minutes to ~1 minute), improved ARC core startup reliability with better timeout handling, and a multiprocessing test to validate shared lock behavior across forked processes. Compiled artifacts, system packages for Ubuntu 22.04/24.04 and Fedora 39, and updated topology discovery logic round out the update.
This bugfix release corrects a memory reporting issue that was causing inaccurate available device memory readouts on NVIDIA GPUs, and folds effective utilization metrics directly into the GPU utilization display so you get a clearer picture of actual workload efficiency without extra navigation.
Syllo/nvtop 3.3.0 expands hardware coverage to Rockchip NPUs, MetaX GPUs, and Enflame GCUs while refining monitoring capabilities across existing platforms. The release introduces an effective load metric that weights utilization by power consumption, separates GPU and memory clock reporting, and improves one-shot mode output—alongside numerous fixes for Mali, Intel Battlemage, AMD integrated graphics, and Nvidia unified memory tracking. If you're monitoring accelerators beyond the usual suspects or need more granular power-aware metrics, this update brings meaningful improvements to the toolkit.
This release broadens installer compatibility and adds new installation options for more flexible Tenstorrent setups. The distro detection now handles derived Linux distributions like Linux Mint by checking the ID_LIKE field, while a new --use-uv flag offers faster Python package installation as an alternative to pip. You can now choose to install Docker as a container runtime option alongside existing choices, and a new "default mode" simplifies the tt-studio installation flow for users who want a streamlined setup experience.
This patch release restores backward compatibility for Blackhole's READ_{TS,PD,VM} message interfaces, reverting changes that shipped in v19.1.0 and would have broken existing code. Additionally, the team disabled MRISC PHY powerdown on BH Galaxy after discovering it caused data errors, prioritizing stability over the power savings the feature was intended to provide.
This patch reverts the Wormhole ERISC firmware to version 6.7.2.0 after 6.7.3.0 introduced ETH training instability in WH Galaxy systems. If you've been experiencing unreliable Ethernet initialization on Wormhole hardware, this update should restore the stability you had before.
This release focuses on streamlining the CI/CD pipeline for publishing, particularly around TestPyPI uploads and artifact handling. The changes ensure that the release process skips redundant artifact uploads, removes unnecessary source distributions during testing phases, and properly sequences the release generation step after TestPyPI publication. These refinements should make the release workflow more efficient and reduce friction for maintainers pushing updates to the ecosystem.
This release tackles stability across multiple fronts: Wormhole firmware gets link training fixes for active cables and improved signal integrity on long traces, while GDDR latency tuning and a Galaxy datarate rollback address recent regressions. On the driver side, SMBus transaction handling now properly respects cancel state, fixing PCIe enumeration issues that users were seeing. The BH ARC library adds cleaner power management tracking for AICLK state and better structured access to system messages, plus some developer conveniences like flash erase support in the bootstrap tool and a version update script to streamline releases.
This release fixes multi-host topology configuration by removing a single-host assumption in the connection map generation, which should help users with distributed setups properly discover and configure their hardware. The board type detection has also been expanded to recognize newer Tenstorrent boards, and tt-tools-common has been bumped to 1.4.33 to pull in any upstream improvements.
This release significantly streamlines the installer by removing legacy installation paths and consolidating the codebase around repository-based package management. Ubuntu 20 support is now fully deprecated, backwards-compatible environment variables have been removed, and manual installation workflows have been phased out in favor of a cleaner argument-parsing approach. The changes also add distribution-aware package manager detection and improve handling of Debian systems specifically, while fixing various shell script issues and ensuring the inference server is properly invoked during installation.
The loader and scheduler in RiescueD have been refactored with explicit interfaces that make their control flow clearer and reduce redundant checks—test_setup now runs once during scheduler initialization rather than on every loop iteration. Bug fixes address deterministic CSR ordering, privilege mode handling in virtualized environments, and stack allocation for algorithm tests, while new documentation with Mermaid flowcharts and API details should help developers understand and debug runtime behavior more easily.
This release expands tt-installer's reach and capabilities by adding Fedora package support through a hosted repository and introducing an install-inference-server function for streamlined deployment of inference workloads. If you're running Fedora or planning to set up inference services, these additions make the installation process more straightforward and reduce platform-specific friction.
Riescue now gives you fine-grained control over test execution with a dedicated End of Test (EOT) module and hooks that let you inject custom logic into the runtime environment. The new Conf library lets you handle complex configuration scenarios in Python while staying within the CLI workflow, making it easier to tailor test sequences without wrestling with command-line arguments. Updated documentation covers all three features, so you can start customizing your test lifecycle right away.
This release strengthens Riescue's reliability and usability with Hart-local storage in the RiescueD runtime that prevents register corruption from unexpected traps—a key fix for stable low-level execution. Documentation improvements now cover the RiescueD runtime environment and link directly to tensor parallelism tutorials, while mode-specific fixes in RiescueC and early standardization work on the test scheduler lay groundwork for more consistent interfaces across the stack.
This release updates tt-topology to handle Xenium reset schema changes via a tools-common bump, while adding better device detection robustness—the tool now exits gracefully when no supported devices are found rather than proceeding with incomplete configuration. A couple of test fixes round out the changes, keeping the codebase clean as the topology layer evolves alongside hardware updates.
Thaddeus Fortenberry, VP of Robotics + Automotive at Tenstorrent, discusses how we're paving the open road to chiplet interoperability in robotics. Open Chiplet Atlas | https://www.openchipletatlas.org Tenstorrent Robotics IP | https://tenstorrent.com/ip X https://twitter.com/tenstorrent Discord https://discord.com/invite/tenstorrent
Watch Jim Keller and Miles Dooley kick off Tenstorrent's IP launch event in San Francisco, TT-Blueprint. Tenstorrent IP | https://tenstorrent.com/ip RISC-V CPU | https://tenstorrent.com/ip/risc-v-cpu Open Chiplet Atlas | https://www.openchipletatlas.org/ X https://twitter.com/tenstorrent Discord https://discord.com/invite/tenstorrent
Wei-Han Lien, Chief Architect at Tenstorrent, discusses how Tenstorrent is pushing the chiplet ecosystem forward with Open Chiplet Atlas. Open Chiplet Atlas | https://www.openchipletatlas.org/ Tenstorrent IP | https://tenstorrent.com/ip RISC-V CPU | https://tenstorrent.com/ip/risc-v-cpu X https://twitter.com/tenstorrent Discord https://discord.com/invite/tenstorrent
Aniket Saha, VP of Product Strategy at Tenstorrent, discusses Tenstorrent's new, open business model for high-performance IP and why the old, closed-ecosystem approach isn't working. Tenstorrent IP | https://tenstorrent.com/ip X https://twitter.com/tenstorrent Discord https://discord.com/invite/tenstorrent
Riescue v1.1.2 adds better error handling for runtime internal failures with new loader_panic and trap_handler_panic mechanisms, making it easier to catch and debug issues when things go wrong during execution. The release also includes pyright type-checking fixes and documentation updates that should improve the developer experience overall.
This release brings the Blackhole platform up to Kernel 6.17 and OpenSBI 1.7, with the kernel patches now aligned with upstream RISC-V contributions. Note that the device tree filename has changed from blackhole-p100.dtb to blackhole-card.dtb, so you'll need to update any boot configurations accordingly. The userspace application has also been updated to handle the multi-descriptor packet format that virtio-net now uses in 6.17, ensuring network virtualization works smoothly on the updated stack.
tt-studio v2.1.0 brings meaningful improvements to agent capabilities and operational reliability, with automatic LLM discovery and health monitoring making multi-model setups more robust, plus a code interpreter tool expanding what agents can do. The release tackles several critical issues including a tt-smi blocking problem that was slowing backend execution, and adds new diagnostics endpoints for API information and service logs that help developers understand what's happening under the hood. UI polish comes through auto-refreshing health checks, expandable model details, and better dark mode support, while infrastructure upgrades to Node.js 22 and Vite 6.3.5 keep the foundation current. For teams running tt-studio in production, the health check improvements and process cleanup fixes should reduce operational friction considerably.
The virtio device implementation and X280 kernel interrupt handler get more robust with this update, addressing edge cases and reliability concerns identified in the codebase. If you're running workloads on Blackhole hardware, the improved PLIC handler should make interrupt handling more predictable, and the virtio changes reduce the surface area for device communication failures. Updated disk images and kernel binaries are ready to download if you want to test these improvements.
New release: tt-torch 0.4.0
The Debian image now includes ca-certificates, which fixes HTTPS connectivity issues that were blocking proper SSL/TLS verification in the environment. This is a foundational fix that ensures secure communication works as expected when building and deploying applications on the Blackhole platform.
This patch release polishes the user experience with a focus on visual clarity and accessibility—the team cleaned up a confusing close button in the UI and fixed error text that was hard to read in light mode, while also refining theme consistency across the application. The improvements reflect community feedback, including a first-time contribution that improved the overall interface clarity. If you've been working with Studio and noticed visual quirks switching between light and dark modes, this update should smooth out those rough edges.
The team has published a white paper on an automated approach to generating architectural coverage models from an instruction set simulator, addressing the challenge of verifying increasingly complex ISA extensions. This framework bridges ISS-only simulations and RTL co-simulation environments, allowing coverage sampling to scale alongside ISA growth without manual effort. For verification engineers, this means a systematic path to comprehensive coverage that adapts across different simulation contexts rather than reinventing verification methodology with each architecture iteration.
This session, led by Tapasvi Patel, Sr. Engineer at Tenstorrent, covers how Tenstorrent's compiler (TT-Forge) and runtime (TT-NN) manage device meshes, tensor sharding, and collective communications, with a focus on using JAX. Also, see how Shardy helps with automatic parallelization. 0:00 Introduction 0:25 Modeling multi-device in Tenstorrent's compiler & runtime 5:33 Parallelism strategies and multi-chip hardware 8:38 Core TT-NN runtime features for multi-device (meshes, tensors, mappers)…
Heterogeneous GPU infrastructures present a binary compatibility challenge: code compiled for one vendor's GPU will not run on another due to divergent instruction sets, execution models, and driver stacks . We propose hetGPU, a new system comprising a compiler, runtime, and abstraction layer that together enable a single GPU binary to execute on NVIDIA, AMD, Intel, and Tenstorrent hardware. The hetGPU compiler emits an architecture-agnostic GPU intermediate representation (IR) and inserts…
With the rapid development of artificial intelligence (AI) applications, an emerging class of AI accelerators, termed Inter-core Connected Neural Processing Units (NPU), has been adopted in both cloud and edge computing environments, like Graphcore IPU, Tenstorrent, etc. Despite their innovative design, these NPUs often demand substantial hardware resources, leading to suboptimal resource utilization due to the imbalance of hardware requirements across various tasks. To address this issue,…
TT-Forge is Tenstorrent’s MLIR-based compiler. This Q&A covers its architecture, front-end support (Torch, JAX), tools like TT-Explorer, and key discussions on runtime customization, debugging, production timelines, quantization, model training, the PiKernel DSL, and how to contribute. 0:00 What is TT-Forge? 1:09 Accessing Tenstorrent hardware on Koyeb 4:39 Compiler production timeline and manual model implementation 7:24 TT-Explorer for quantization schemes 8:26 Model training framework with…
Tenstorrent Developer Day Livestream April 3rd at 10 am PT
TT-Forge is Tenstorrent’s MLIR-based compiler. Learn how TT-Forge integrates with our AI software stack, why we’re building on MLIR, and the features that make TT-Forge flexible and adaptable. 0:00 Introduction 1:36 Integration with key ML frameworks 2:05 Why MLIR? 3:10 Driving principles of TT-Forge 7:48 Recap tt-forge: https://github.com/tenstorrent/tt-forge tt-mlir: https://github.com/tenstorrent/tt-mlir Follow Tenstorrent on X at https://x.com/tenstorrent Join our Discord at…
nvtop 3.2.0 expands hardware coverage significantly with support for Intel XE GPUs, Broadcom V3D accelerators (Raspberry Pi), and Google TPUs, making it a more versatile monitoring tool across different compute platforms. Beyond new hardware, the release adds practical features like JSON snapshot exports and a process-list hiding option for cleaner monitoring views, while improving existing Intel i915 support and backend stability. The conda-forge availability makes installation simpler for users in that ecosystem.
This release brings support for ten additional model variants—including Phi2, Qwen1.5-0.5B, and YOLOX—along with initial Wormhole n300 dual-chip support for the TT-LoudBox and TT-QuietBox systems. Performance improvements are mixed across architectures: Grayskull sees solid gains on HRNet (41%), while Wormhole n300 single-chip shows 14% uplift on Falcon-7B, though some models like FLAN-T5 have regressed. CNN models now work on 4-chip and 8-chip MMIO configurations, and the compiler received stability fixes alongside improved documentation.
Quite pleasing to hear Jim admit he had wrong assumptions about how difficult this was going to be. I genuinely can't fathom a piece of hardware that can actually run tensor operations concurrently. Not sure how to design it or invent it. Cool idea though! https://m.youtube.com/watch?v=cy-9Jl666Aw&t=2s submitted by /u/ronalurker777 [link] [comments]
submitted by /u/brand_momentum [link] [comments]
submitted by /u/brand_momentum [link] [comments]
submitted by /u/brand_momentum [link] [comments]
This release expands model coverage with five new variants including SSD300 ResNet50 and multiple YOLOv6 configurations, while introducing initial support for Wormhole n300 hardware across single-chip and 4-chip setups. Performance gains are mixed—Wormhole n150 shows broad 6% improvements with standouts like MobileNetV2 (46%) and YOLOv5 (20%), though Grayskull sees some notable regressions that the team plans to address—alongside general compiler stability improvements and better documentation. Known issues around Whisper, Stable Diffusion, and ethernet hangs on dual-chip n300 data parallel configurations are flagged for upcoming patches.
https://arxiv.org/abs/2406.02528 submitted by /u/ronalurker777 [link] [comments]
This alpha release expands model coverage with 23 new variants, including DLA architectures across ONNX and both image classification and semantic segmentation versions of SegFormer and PerceiverIO in PyTorch. The broader set of supported models should help developers test more diverse workloads on Tenstorrent hardware, though general compiler stability improvements across the board suggest the team has been focused on solidifying the foundation as well. Keep in mind this is still alpha software, so mileage may vary depending on your specific use case.
This alpha release brings support for the full DLA (Deep Layer Aggregation) model family across ten variants, expanding the range of architectures you can target with tt-buda. Beyond the new models, the team has focused on compiler stability with targeted bug fixes and upgraded the bundled diffusers library to v0.27.2 for better compatibility with recent generative model implementations. As always with alpha releases, while the core improvements are solid, you should test thoroughly before relying on this for production workloads.
This release brings meaningful performance gains across both Grayskull and Wormhole hardware, with Wormhole seeing particularly strong results (52% average improvement with no regressions) and Grayskull hitting 30% average gains, though a few models like FLAN-T5 saw slowdowns that may warrant investigation. On the feature side, there's new support for Perceiver IO and HarDNet architectures, Ubuntu 22.04 is now the default OS, and the compiler picked up various stability fixes including better handling of multi-user workloads—all of which should make the platform more robust for production deployments. DeBuda is now available as a standalone wheel, and documentation improvements should help new users get up to speed faster.
nvtop now covers a broader range of accelerators with added support for Adreno GPUs via the panfrost driver, Apple Silicon GPUs (M1/M2), and Huawei Ascend accelerators, making it more useful across heterogeneous hardware environments. The release also fixes a crash related to configuration file discovery and prevents unnecessary handler calls for disabled devices, improving stability and efficiency for users managing diverse GPU setups.
submitted by /u/bearacorn [link] [comments]
submitted by /u/JoesRevenge2 [link] [comments]