n150 n300 T3000 p100 p300c 60 min Validated

Native Video Animation with AnimateDiff

Run SD 1.4 video generation on Blackhole® — 15 seconds per frame, real images, no CPU fallback on the UNet.

This lesson walks through three paths to animated video, escalating from CPU baseline to full Blackhole hardware acceleration with cross-frame temporal attention:

Along the way you'll learn the model bring-up methodology: how to create standalone packages that integrate with TT-Metalium without modifying the core repository.

What you'll build

These were generated on a single Blackhole p300c — 8 frames × 25 steps each:

"World's Fair 2099" "Phosphor Horizon" "Nebula" "Mayan Temple"
world of tomorrow phosphor horizon nebula mayan temple

And a full 35-GIF cosmic study across the chip's 11×10 Tensix grid — see the live showcase.

Chip grid — 35 GIFs across 110 Tensix nodes


What is AnimateDiff?

AnimateDiff adds temporal attention to Stable Diffusion 1.4 by injecting TemporalTransformer blocks into every BasicTransformerBlock in the UNet — at the 320-dim feature level where the motion weights were trained:

SD 1.4 UNet WITHOUT MotionAdapter:
  Noise → [Down blocks] → [Mid block] → [Up blocks] → Denoised latent
           each block: BasicTransformerBlock(spatial attention only)

SD 1.4 UNet WITH MotionAdapter (Phase 1):
  Noise → [Down blocks] → [Mid block] → [Up blocks] → Denoised latent
           each block: BasicTransformerBlock
                         └── spatial attention (unchanged, 320-dim)
                         └── TemporalTransformer (cross-frame, 320-dim)
                                            ↑
                               mm_sd_v15_v2.ckpt weights here

Why SD 1.4, not SD 3.5? AnimateDiff motion weights (mm_sd_v15_v2.ckpt) were trained for SD 1.5's UNet with 320-dim transformer blocks. SD 3.5 uses a DiT with 2432-dim blocks — architecturally incompatible. The diffusers AnimateDiffPipeline handles MotionAdapter injection automatically when paired with the SD 1.4 base model.


Step 1: Get the project

The tt-animatediff repo is public. Clone it directly — you own your copy:

git clone --depth 1 --branch v0.1.0 \
    https://github.com/tenstorrent/tt-animatediff.git \
    ~/tt-projects/tt-animatediff

cd ~/tt-projects/tt-animatediff
python3 -m pip install -e ".[dev]"

Or use the button above — it runs the same clone + install steps automatically:

📦 Setup AnimateDiff Project
mkdir -p ~/tt-projects && git clone --depth 1 --branch v0.1.0 https://github.com/tenstorrent/tt-animatediff.git ~/tt-projects/tt-animatediff 2>&1 || (cd ~/tt-projects/tt-animatediff && git fetch --tags && git checkout v0.1.0) && cd ~/tt-projects/tt-animatediff && python3 -m pip install -e ".[dev]" && python3 -c "import animatediff_ttnn; print(\

Project structure:

tt-animatediff/
├── animatediff_ttnn/
│   ├── pipeline.py            # Phase 1: CPU AnimateDiffPipeline wrapper
│   ├── ttnn_pipeline.py       # Phase 2/2.5: Blackhole TT-NN UNet + PNDM scheduler
│   ├── temporal_attention.py  # Phase 2.5: cross-frame self-attention
│   └── temporal_module.py     # Reference — temporal attention math
├── examples/
│   ├── generate.py            # Unified entry point (--mode cpu|blackhole|sim)
│   ├── generate_baseline.py   # Phase 1 CPU shim
│   └── generate_sim.py        # Phase 2.5 on ttsim simulator shim
├── scripts/
│   └── generate_study.py      # Batch generation (35-GIF cosmic study)
├── docs/
│   ├── INTEGRATION_GUIDE.md
│   ├── SIMULATOR.md
│   └── HARDWARE_COMPAT.md
└── tests/                     # 16 CPU/mock tests, no hardware required

Step 2: Download the models

# SD 1.4 — required for all phases
hf download CompVis/stable-diffusion-v1-4

# AnimateDiff motion adapter — Phase 1 only
hf download guoyww/animatediff-motion-adapter-v1-5-2

Step 3: Phase 1 — CPU AnimateDiffPipeline

The diffusers AnimateDiffPipeline loads SD 1.4, injects the MotionAdapter at every transformer block, and denoises all frames simultaneously with temporal attention. This is the reference implementation — correct output to compare against.

🎬 Run Phase 1 (CPU)
ls -lh ~/tt-projects/tt-animatediff/output/ 2>/dev/null || echo "No output yet — run Phase 1 or Phase 2.5 first"

cd ~/tt-projects/tt-animatediff
python3 examples/generate.py --mode cpu \
    --prompt "aurora borealis over arctic ice, green and violet ribbons, cinematic" \
    --frames 8 --steps 25 --output output/phase1.gif

Expected: ~2 min/frame on CPU. output/phase1.gif — 8 frames of temporally coherent animation.

What happens inside:

from diffusers import AnimateDiffPipeline, MotionAdapter

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
pipe = AnimateDiffPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", motion_adapter=adapter)

# All 8 frames denoised together — temporal attention at every step
result = pipe(prompt=prompt, num_frames=8, num_inference_steps=25)
frames = result.frames[0]  # List of PIL Images

Step 4: Phase 2.5 — Blackhole + temporal attention (canonical)

Replaces the PyTorch UNet with the TT-NN UNet from ~/tt-metal, running natively on Blackhole silicon. Cross-frame self-attention is applied at each PNDM step across all N frame latents simultaneously — giving genuine temporal coherence at hardware speed.

Requires: Blackhole hardware (p100/p150/p300c/TT-QuietBox® 2) and ~/tt-metal built.

⚡ Run Phase 2.5 (Blackhole)

source ~/tt-metal/python_env/bin/activate
cd ~/tt-projects/tt-animatediff

python3 examples/generate.py --mode blackhole \
    --prompt "1939 World's Fair imagined from the year 2099, art deco spires at golden dusk, retro-futurist optimism, cinematic 4K" \
    --frames 8 --steps 25 --temporal-alpha 0.35 \
    --output output/blackhole.gif

Expected output:

AnimateDiff — Blackhole hardware (TTNN UNet + cross-frame temporal attention)
  Frames : 8  Steps: 25  Temporal alpha: 0.35

Opening Blackhole device...
Loading SD 1.4 models...
  Building TTNN UNet (~2-3 min first run, cached after)...
  Loaded in 7.3s

Generating 8 frame(s)...
  Done in 121.4s (15.2s/frame)

Saved 8 frame(s) → output/blackhole.gif

Performance on P300C: 7s model load + ~15s/frame. Kernel compilation ~2–3 min on first run, cached after.


What we built — the improvements

The tt-animatediff repo went through several rounds of bring-up work to get to this state. Here's what was added beyond the basic TTNN UNet port:

Cross-frame temporal attention (Phase 2.5)

The original Phase 2 denoised each frame independently with only shared base noise for coherence. Phase 2.5 adds a CPU cross-frame self-attention pass at each denoising step:

For step t in [T → 0]:
    For frame i in [0, N]:
        noise_pred[i] = TTNN_UNet(latent[i], t)   # Blackhole hardware
    noise_preds = cross_frame_attention(stack)      # CPU, ~0ms for N=8
    For frame i in [0, N]:
        latent[i] = scheduler.step(noise_pred[i])

The cross-frame attention (animatediff_ttnn/temporal_attention.py) reshapes the stacked noise predictions so frames attend to each other, then blends the result back via --temporal-alpha:

def cross_frame_attention(x: torch.Tensor, alpha: float = 0.35) -> torch.Tensor:
    # x: (N, C, H, W) — one noise prediction per frame
    N, C, H, W = x.shape
    flat = x.view(N, C, H*W).permute(2, 0, 1)  # (H*W, N, C)
    attn_out = F.scaled_dot_product_attention(flat, flat, flat)
    attn_out = attn_out.permute(1, 2, 0).view(N, C, H, W)
    return (1 - alpha) * x + alpha * attn_out

Hardware resilience

setup_blackhole() reads hwmon sentinel values before opening the device. If a chip shows a dead-ARC temperature (> 1,000,000 millidegrees in temp1_input), a warning is emitted. If you see Timed out while waiting for active ethernet core, run:

tt-smi -r 0 1 2 3 && sleep 8

Then retry — this clears hung ethernet cores from prior incomplete teardowns.

Unified entry point

All three modes are one script with a --mode flag:

python3 examples/generate.py --mode cpu        # any machine, ~2 min/frame
python3 examples/generate.py --mode blackhole  # Blackhole hardware, ~15 s/frame
python3 examples/generate.py --mode sim        # no hardware, ttsim virtual device

tt-metal path tracking

Between firmware 19.5.0 and 19.8.0, the SD 1.4 demo moved from models.demos.wormhole.stable_diffusion to models.demos.vision.generative.stable_diffusion.wormhole. See docs/HARDWARE_COMPAT.md for recovery steps.


Prompt guide

SD 1.4 at 512×512 with the TTNN UNet has a distinct personality. Knowing it gets you better results.

What it does well

Category Examples
Cosmic & abstract nebulae, aurora, galaxies, energy fields, sacred geometry
Natural scenes forests, oceans, deserts, fire, water, sky
Painterly styles oil painting, watercolor, impressionism, concept art
Cinematic lighting golden hour, neon glow, moonlight, candlelight
Architecture temples, ruins, castles, sci-fi structures
Retro aesthetics CRT glow, film grain, vaporwave, cyberpunk

What to avoid

Prompt patterns that work

# Style before subject — model weights the style heavily
"watercolor painting of ancient ruins at sunset, soft brushstrokes, muted palette"

# Cinematic lighting descriptors unlock quality
"cinematic 4K, dramatic side lighting, volumetric fog, depth of field"

# Cosmic + architecture is the sweet spot for this model
"Mayan pyramid under a swirling nebula, starfield, bioluminescent jungle, cinematic 4K"

# Motion-friendly subjects produce the best animation
"crackling campfire"   "ocean waves"   "swirling clouds"   "aurora borealis"
"shifting cosmos"      "flowing lava"  "drifting smoke"    "mandala blooming"

--temporal-alpha tuning

Value Effect
0.0 No cross-frame mixing — shared noise only
0.2–0.3 Subtle coherence, natural variation
0.35 Default — good for most subjects
0.5–0.7 Strong coherence, background stabilises
1.0 Maximum blending, very low motion

Fast motion (fire, water): 0.2–0.35 · Slow drift (cosmos, aurora): 0.4–0.6

Sacred geometry Circuit as nature Chip as cosmos
mandala circuit moss chip cosmos
sacred mandala blooming from starfield circuit board growing like moss Blackhole chip glowing with embedded cosmos
Aurora Mayan temple Nebula
aurora mayan temple nebula
aurora borealis over arctic ice ancient Mayan temple under shifting cosmos swirling nebula in deep space

How Phase 2.5 works

animatediff_ttnn/ttnn_pipeline.py — the Blackhole denoising loop:

def generate_frames_temporal(device, ttnn_model, torch_vae, config,
                              torch_time_proj, text_embeddings,
                              num_frames, num_steps, seed, temporal_alpha):
    generator = torch.Generator().manual_seed(seed)
    base_noise = torch.randn((1, 4, 64, 64), generator=generator)

    # Per-frame seeded perturbation for variation
    frame_latents = [
        base_noise + 0.05 * torch.randn((1,4,64,64), generator=generator)
        for _ in range(num_frames)
    ]

    for step_idx, t in enumerate(timesteps):
        # TTNN UNet forward pass per frame on Blackhole
        noise_preds = []
        for i in range(num_frames):
            lat = to_device(frame_latents[i], device, ...)
            ttnn_out = ttnn_model(lat, timestep=_tlist[step_idx], ...)
            noise_preds.append(from_device(tt_guide(ttnn_out, guidance_scale), device))

        # Cross-frame attention on CPU — frames agree on structure
        noise_preds = cross_frame_attention(torch.stack(noise_preds), alpha=temporal_alpha)

        # Scheduler step
        for i in range(num_frames):
            frame_latents[i] = pndm_step(noise_preds[i], t, frame_latents[i])

    # VAE decode on CPU — TTNN VAE conv_out OOMs on Blackhole's L1 grid
    return [vae_decode(torch_vae, lat) for lat in frame_latents]

CLIP encoding uses the text encoder bundled inside SD 1.4 — no separate download. Tokens are padded 77 → 96 to match the TTNN UNet's expected sequence length.

MeshDevice: setup_blackhole() opens a MeshDevice(1,1) on a single chip. The SD 1.4 TTNN UNet uses ttnn.to_torch() without a mesh composer, which crashes on multi-chip tensors — single-chip until Phase 3 ships ShardTensorToMesh batched dispatch.


The model bring-up methodology

What this project demonstrates is the complete workflow for integrating any new model:

Phase 1: Research (1–2 hours)

  1. Clone the reference implementation
  2. Read the architecture files — code reveals reality, papers tell the story
  3. Document key patterns: reshaping logic, injection points, weight keys
  4. Verify on CPU first — a working baseline is your ground truth

Phase 2: Design (30–60 min)

  1. Standalone package over monorepo modification — your code, your ownership
  2. Define the API surface: what does a caller need to pass in?
  3. Identify the TT-Metalium integration boundary: which ops stay PyTorch, which go TT-NN?

Phase 3: Implementation (2–4 hours)

  1. Start with PyTorch — easier to debug, matches reference
  2. Build the TT-NN path after PyTorch is validated
  3. Keep the CPU path alive as a regression check

Phase 4: Packaging (1 hour)

Phase 5: Validate on hardware (1–2 hours)

  1. First run: expect kernel compilation ~2–3 min — normal, cached after
  2. Compare output to Phase 1 CPU baseline at same prompt and seed
  3. If hardware hangs: tt-smi -r 0 1 2 3 && sleep 8

Total for a complete new model: 6–10 hours.


Known limitations

Issue Status
TT-NN VAE OOMs on Blackhole conv_out Workaround: CPU PyTorch VAE decode
No TemporalTransformer blocks in TT-NN UNet Phase 2.5 adds CPU cross-frame attention as a bridge
Single-chip only (multi-chip crashes on to_torch()) Use device_ids=[0] until Phase 3
TT-Metalium SD path changed in firmware 19.8.0 See docs/HARDWARE_COMPAT.md
First run 2–3 min kernel compilation Expected; cached after first run

No hardware? Use the simulator

ttsim is a bit-exact Blackhole simulator that runs on any Linux/x86_64 machine — same TTNN dispatch path as real hardware.

python3 examples/generate.py --mode sim \
    --sim ~/sim/libttsim_bh.so \
    --frames 2 --steps 4 --output output/sim_test.gif

See docs/SIMULATOR.md in the repo for full setup.


What's next

Add TemporalTransformer blocks to the TT-NN UNet

Full Phase 3 would inject TemporalTransformer blocks into the TT-NN UNet's BasicTransformerBlock instances — native TT-NN temporal attention, eliminating the CPU bounce in Phase 2.5.

Apply this pattern to other models

The standalone package pattern works for any model:


Resources