Fun Demos

There’s a specific moment that happens when someone skeptical about specialized hardware sees tt-toplike running. Their face shifts. The demos in this chapter produce that moment reliably. They’re not benchmarks. They’re invitations — to look, to question, to want to understand what the machine is actually doing.

Four demos. Each one stands alone. All of them run on your QB2 today.

Demo 1: Arcade Mode

Open a terminal. Start a model serving in another session, or just leave the hardware idle. Then run:

tt-toplike --mode arcade

The screen fills. A hero character moves with chip telemetry — position, speed, direction all derived from actual hardware readings. AICLK frequency, power draw, thermal state. The game is the monitor. The monitor is the game.

This is the first thing to show anyone who claims hardware monitoring is inherently boring. It isn’t. It just needs better defaults.

To exit: q or Ctrl-C.

The demo lands harder when the chips are busy. Start a model in one terminal, then open arcade mode in another. The activity you see reflects real computation.

Install tt-toplike if it isn’t already present:

# tt-toplike is not in the Tenstorrent apt PPA — install from GitHub releases or via cargo:
# https://github.com/tenstorrent/tt-toplike/releases
sudo dpkg -i tt-toplike_*.deb
# Or: cargo install tt-toplike

Demo 2: Flow Mode

Where arcade mode is expressive, flow mode is accurate-expressive. Run:

tt-toplike --mode flow

Particle streams trace the NOC — Tenstorrent’s Network on Chip. During inference, data moves from the DRAM perimeter into the compute cores and back out again, each hop a real transaction on a real fabric. Flow mode makes those streams visible as animated particles.

Watch what happens when you start or stop inference. The particle density changes. The path patterns change. You’re watching the chip’s actual communication graph in motion.

This one tends to generate the most questions. “What are those things?” is how good conversations start.

tt-toplike memory-flow mode — DRAM channels animated across the four Blackhole chips
tt-toplike flow mode — particle streams trace data moving between DRAM and compute cores in real time

Demo 3: AI Video Generation — Live Generative Art

tt-local-generator is a GTK4 desktop app that runs video generation entirely locally, no API key, no cloud dependency. The Wan2.2 text-to-video model produces 480×832 clips using all four Blackhole chips. Each clip takes roughly six minutes.

Set up a continuous generation loop and you have a generative art installation:

# Install tt-local-generator from GitHub releases (not in the Tenstorrent apt PPA)
# https://github.com/tenstorrent/tt-local-generator/releases
sudo dpkg -i tt-local-generator_*.deb

# Launch the app
tt-local-generator

In the app, open the video generation panel. Write a prompt. Let it run. The app has an “attractor mode” that generates clips continuously and plays them fullscreen. Walk away. Come back to a wall of generated cinema.

For a polished installation setup — fullscreen display, auto-start on login, continuous prompts — see the QB2 video generation lesson.

The AnimateDiff integration in tt-local-generator also runs natively on QB2. Shorter clips, different aesthetic, same local-only principle. See tt-animatediff for the standalone library.

Demo 4: Local 70B with No Internet Required

Four chips. A language model with 70 billion parameters. No API key. No latency spike from a datacenter on another continent.

The fastest path is through tt-inference-server, which handles the Docker container and weight caching automatically:

docker run \
  --env "HF_TOKEN=$HF_TOKEN" \
  --ipc host \
  --publish 8000:8000 \
  --device /dev/tenstorrent \
  --mount type=bind,src=/dev/hugepages-1G,dst=/dev/hugepages-1G \
  --volume volume_id_Llama-3.3-70B-Instruct:/home/container_app_user/cache_root \
  ghcr.io/tenstorrent/tt-inference-server/vllm-tt-metal-src-release-ubuntu-22.04-amd64:0.10.1-555f240-22be241 \
  --model Llama-3.3-70B-Instruct \
  --tt-device p300x2

Wait for Application startup complete — first run downloads 140 GB of weights, so plan ahead. Then ask it something that requires reasoning:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-3.3-70B-Instruct",
    "messages": [{"role": "user", "content": "Explain the tradeoffs between data-parallel and model-parallel inference for large language models. Be specific about memory and latency."}]
  }' | python3 -c "import json,sys; d=json.load(sys.stdin); print(d['choices'][0]['message']['content'])"

The response comes from silicon in your own office. The model that required a specialized cloud service a year ago runs on hardware you own.

That’s the demo.

For the full walkthrough — prerequisites, HuggingFace token setup, the DeepSeek reasoning model variant, and troubleshooting — see Running Llama-3.3-70B on QB2.

⬡ Tensix Grid — Blackhole (P100/P150/P300c / QB2)

One Blackhole chip running a slice of Llama-3.1-70B. All four of yours look like this, simultaneously.

tt-toplike showing version, chip clocks, and available modes on a QB2
tt-toplike on a live QB2 — checking the tool and hardware state before launching

Demo 5: TT-Forge Compiletron — Hunt the Wild Model

The premise is simple: compile as many models as possible, score points, and try not to hit something that crashes the runtime.

tt-forge-compiletron is a roguelike model compilation game. It is also a genuine engineering survey tool. The gameplay loop runs models from the tt-forge-models zoo (200+ curated PyTorch and JAX models) through forge.compile() on your Blackhole chips. Each successful compile earns points. First-ever compile of a model earns a ×5 multiplier. The scoring system rewards breadth, rarity, and speed.

The TUI is three screens running in a Textual interface: a model queue on the left, live per-chip compilation status in the center, and a running scoreboard with ASCII art banners rendered in pyfiglet. When a model compiles successfully, the First Voice feature kicks in — a themed inference pass that prints the model’s first decoded output on Tenstorrent silicon. The first time a language model generates a token on your hardware, it announces itself with a banner.

The bestiary at data/bestiary.json grows with every session. It records compile status, timing, and failure class for every model attempted. Come back later and it knows what you’ve already conquered.

Set up and launch:

cd ~/code/tt-forge-compiletron
source ~/tt-forge-fe/env/activate
python3 expedition.py run --tui --seed-only --limit 8 --chips 4

--seed-only pulls from the curated zoo only, --limit 8 caps the session at eight models, --chips 4 assigns all four Blackhole cards. A good starting session.

First Voice is the payoff. After each successful compile, the game runs one inference pass and prints the model’s decoded output. A sentiment classifier labeling its first sentence. A ResNet identifying its first image. A GPT-2 generating its first token. Watch the chip do something real with what it just learned to run.

⬡ Four chips, four models — your QB2's two p300c cards

Four genuine Blackhole chips across two p300c cards, joined by the Samtec link. The model-zoo game compiles a different model on each, in parallel.


Next: Ubuntu Customization →