N150 N300 T3K P100 P300C 45 min Validated

Native Video Animation with AnimateDiff

Run SD 1.4 video generation on Blackhole — 15 seconds per frame, real images, no CPU fallback on the UNet.

This lesson shows two paths to generating animated videos:

Along the way you'll learn the model bring-up methodology: how to create standalone packages that integrate with TT-Metal without modifying the core repository.

What you'll build

These frames were generated by Phase 2 running on a P300C (Blackhole):

Prompt: "1939 World's Fair imagined from the year 2099, art deco spires at golden dusk, retro-futurist optimism, cinematic 4K" Prompt: "purple phosphor glow across distant mountains at 2am, retro CRT haze, cyan mist drifting through valleys, cinematic"
world of tomorrow demo phosphor horizon demo

8 frames × 25 denoising steps = 121 seconds on P300C.


What is AnimateDiff?

AnimateDiff adds temporal attention to Stable Diffusion 1.4 by injecting TemporalTransformer blocks into every BasicTransformerBlock in the UNet — at the 320-dim feature level where the motion weights were trained:

SD 1.4 UNet WITHOUT MotionAdapter:
  Noise → [Down blocks] → [Mid block] → [Up blocks] → Denoised latent
           each block: BasicTransformerBlock(spatial attention only)

SD 1.4 UNet WITH MotionAdapter (Phase 1):
  Noise → [Down blocks] → [Mid block] → [Up blocks] → Denoised latent
           each block: BasicTransformerBlock
                         └── spatial attention (unchanged, 320-dim)
                         └── TemporalTransformer (cross-frame, 320-dim)
                                            ↑
                               mm_sd_v15_v2.ckpt weights here

Why SD 1.4, not SD 3.5? AnimateDiff motion weights (mm_sd_v15_v2.ckpt) were trained for SD 1.5's UNet with 320-dim transformer blocks. SD 3.5 uses a DiT with 2432-dim blocks — architecturally incompatible. The diffusers AnimateDiffPipeline handles MotionAdapter injection automatically when paired with the SD 1.4 base model.

How temporal attention works:

# Input: (batch*frames, spatial_tokens, channels)
# e.g.: (16, 4096, 320) for 16 frames of 64×64 latents

# Reshape to expose frame dimension
hidden = hidden.view(batch, frames, spatial, channels)
hidden = hidden.permute(0, 2, 1, 3)           # (b, spatial, frames, c)
hidden = hidden.reshape(batch*spatial, frames, channels)
# e.g.: (4096, 16, 320)
# Standard attention across the 16 frames → temporal coherence

Frames attend across each other at every denoising step — motion is baked in, not post-processed.

Phase 2 tradeoff: The TTNN UNet does not currently have TemporalTransformer blocks, so frames are denoised sequentially with shared base noise for coherence. This gives ~15 s/frame vs ~2 min/frame for Phase 1. Adding full temporal attention to the TTNN UNet is future work.


Step 1: Deploy the project to your scratchpad

The AnimateDiff package is bundled with the extension. This copies it to ~/tt-scratchpad/tt-animatediff/ and installs it as an editable Python package:

📦 Setup AnimateDiff Project
mkdir -p ~/tt-scratchpad/tt-animatediff && cp -r "{{projectPath}}"/* ~/tt-scratchpad/tt-animatediff/ && cd ~/tt-scratchpad/tt-animatediff && pip install -e . && python3 -c "import animatediff_ttnn; print(\

What this does:

Project structure you'll have:

~/tt-scratchpad/tt-animatediff/
├── animatediff_ttnn/
│   ├── pipeline.py           # Phase 1: CPU AnimateDiffPipeline wrapper
│   └── ttnn_pipeline.py      # Phase 2: Blackhole TTNN UNet + PNDM scheduler
├── examples/
│   ├── generate_baseline.py  # Phase 1 (CPU, any hardware)
│   └── generate_blackhole.py # Phase 2 (Blackhole hardware)
├── output/                   # Generated GIFs land here
└── setup.py

Step 2: Download the models

# SD 1.4 — required for both phases
hf download CompVis/stable-diffusion-v1-4

# AnimateDiff motion adapter — Phase 1 only
hf download guoyww/animatediff-motion-adapter-v1-5-2

Step 3: Phase 1 — CPU AnimateDiffPipeline

The diffusers AnimateDiffPipeline loads SD 1.4, injects the MotionAdapter at every transformer block, and denoises ALL frames simultaneously with temporal attention. This gives true frame-to-frame coherence from the latent-space denoising.

🎬 Run Phase 1 (CPU)
ls -lh ~/tt-scratchpad/tt-animatediff/output/ 2>/dev/null || echo "No output yet — run Phase 1 or Phase 2 first"

Expected output (output/phase1.gif):

Loading pipeline...
Generating frames: 100%|██████████| 25/25
Saved 8 frames → output/phase1.gif

Performance: ~2 min/frame on CPU, ~12–21 s/frame on N150/N300.

What happens inside:

from diffusers import AnimateDiffPipeline, MotionAdapter

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
pipe = AnimateDiffPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", motion_adapter=adapter)

# All 8 frames denoised together — temporal attention at every step
result = pipe(prompt=prompt, num_frames=8, num_inference_steps=25)
frames = result.frames[0]  # List of PIL Images

Or run it directly from your scratchpad with custom prompts:

cd ~/tt-scratchpad/tt-animatediff
python examples/generate_baseline.py \
    --prompt "purple phosphor glow across distant mountains at 2am, retro CRT haze, cinematic" \
    --frames 8 --steps 25 --output output/phosphor_cpu.gif

Step 4: Phase 2 — Blackhole TTNN UNet

Replaces the PyTorch UNet with the TTNN UNet from ~/tt-metal/models/demos/wormhole/stable_diffusion/. The denoising loop runs on Blackhole; latents are decoded with the CPU PyTorch VAE (the TTNN VAE OOMs on Blackhole's final conv_out due to a L1 grid mismatch in the Wormhole-targeted kernel — see Known Limitations below).

Requires: Blackhole hardware (P100/P150/P300C/QB2) and ~/tt-metal built.

⚡ Run Phase 2 (Blackhole)

Expected output (output/blackhole.gif):

AnimateDiff Phase 2 — Blackhole TTNN UNet
  Prompt    : 1939 World's Fair imagined from the year 2099, art deco spires at golden dusk, retro-futurist optimism, cinematic 4K
  Frames    : 8  Steps: 25

Opening Blackhole device...
Loading SD 1.4 models onto Blackhole...
  Models loaded in 7.3s
Encoding prompts with CLIP...
Generating 8 frames on Blackhole...
  Frame 1/8 done
  ...
  Frame 8/8 done
  Generated in 121.0s (15.1s/frame)
Saved 8 frames -> output/blackhole.gif

Performance on P300C: 7s model load + ~15s/frame. Kernel compilation ~2–3 min on first run (cached after).

Or run directly with custom prompts:

cd ~/tt-metal && source python_env/bin/activate
export TT_METAL_HOME=~/tt-metal TT_METAL_ARCH_NAME=blackhole
cd ~/tt-scratchpad/tt-animatediff
python examples/generate_blackhole.py \
    --prompt "your prompt here" --frames 8 --steps 25

Step 5: View your output

📁 View Output Files

GIFs are in ~/tt-scratchpad/tt-animatediff/output/. Open them in any image viewer or browser.


How Phase 2 works

animatediff_ttnn/ttnn_pipeline.py — the Blackhole pipeline:

def generate_frames(device, ttnn_model, torch_vae, config, ttnn_scheduler, ...):
    for frame_idx in range(num_frames):
        # Reset PNDM scheduler state (counter, ets buffer) before each frame
        ttnn_scheduler.set_timesteps(num_steps)

        # Shared base noise + small per-frame perturbation = inter-frame coherence
        frame_noise = base_noise + 0.05 * torch.randn_like(base_noise)
        ttnn_latents = ttnn.from_torch(frame_noise, ...)

        # Full PNDM denoising loop — runs on Blackhole TTNN
        for index in range(len(time_step)):
            latent_input = ttnn.concat([ttnn_latents, ttnn_latents], dim=0)  # CFG
            noise_pred = ttnn_model(latent_input, timestep=_tlist[index], ...)
            noise_pred = tt_guide(noise_pred, guidance_scale)
            ttnn_latents = ttnn_scheduler.step(noise_pred, t, ttnn_latents).prev_sample

        # Decode with PyTorch VAE on CPU (TTNN VAE OOMs on Blackhole conv_out)
        latents_cpu = ttnn.to_torch(ttnn_latents).float() / 0.18215
        decoded = torch_vae.decode(latents_cpu).sample

CLIP encoding uses the text encoder bundled inside SD 1.4 — no separate model download:

tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="text_encoder")
# Pad 77 → 96 tokens: TTNN UNet expects 96-token sequences
embeds = torch.nn.functional.pad(embeds, (0, 0, 0, 19))  # (1, 96, 768)

Prompt tips

SD 1.4 responds well to photography-style prompts:

Goal Prompt
Retro-futurist city "1939 World's Fair imagined from the year 2099, art deco spires at golden dusk, cinematic 4K"
Phosphor landscape "purple phosphor glow across distant mountains at 2am, retro CRT haze, cyan mist, cinematic"
Night sky "starry night sky over mountains, long exposure, 4K"
Abstract "colorful aurora borealis, northern lights, long exposure"

Tuning coherence: The 0.05 noise perturbation in ttnn_pipeline.py controls frame variation. Edit ~/tt-scratchpad/tt-animatediff/animatediff_ttnn/ttnn_pipeline.py to adjust — higher values give more frame-to-frame motion.


The model bring-up methodology

What this project demonstrates is the complete workflow for integrating any new model:

Phase 1: Research (1–2 hours)

  1. Clone the reference implementation (guoyww/AnimateDiff)
  2. Download model weights to understand their structure
  3. Read the core architecture files (not just papers — code reveals reality)
  4. Document key patterns: reshaping logic, injection points, weight keys

Phase 2: Design (30–60 min)

  1. Choose standalone vs integrated approach — standalone wins for maintainability
  2. Create project structure outside ~/tt-metal/ (your code, your ownership)
  3. Define the API surface
  4. Identify the TT-Metal integration boundary

Phase 3: Implementation (2–4 hours)

  1. Start with PyTorch — easier to debug, matches reference
  2. Implement the core module first
  3. Build the high-level wrapper second
  4. Add the TTNN path last, after PyTorch is validated

Phase 4: Packaging (1 hour)

Phase 5: Validate on hardware (1–2 hours)

  1. First run: expect kernel compilation ~2–3 min — that's normal
  2. Verify output shapes match expected latent dimensions
  3. Check VAE decode produces recognizable images
  4. Benchmark: frames/second, memory usage

Total for a complete new model: 6–10 hours. This is the path from demos to real applications.


Known limitations

Issue Status
TTNN VAE OOMs on Blackhole conv_out Workaround: CPU PyTorch VAE decode
No TemporalTransformer in TTNN UNet Phase 1 only; Phase 2 uses shared-noise coherence
DispatchCoreAxis.ROW crashes on Blackhole Avoided: setup_blackhole() uses auto-detect
First run 2–3 min kernel compilation Expected; cached after first run

What's next

Add temporal attention to the TTNN UNet

Full Phase 2 would inject TemporalTransformer blocks into the TTNN UNet's BasicTransformerBlock instances — the same injection pattern as Phase 1's MotionAdapter, but in TTNN ops. This would bring true AnimateDiff temporal coherence to Blackhole-accelerated generation.

Apply this pattern to other models

The standalone package pattern works for any model: