N150 N300 T3K P100 P300C 45 min Validated

Native Video Animation with AnimateDiff

Run SD 1.4 video generation on Blackhole — 15 seconds per frame, real images, no CPU fallback on the UNet.

This lesson shows two paths to generating animated videos:

Phase 1 (any hardware) — diffusers AnimateDiffPipeline on CPU, full temporal attention via MotionAdapter, ~2 min/frame
Phase 2 (Blackhole) — TTNN UNet on Blackhole, sequential denoising, ~15 seconds/frame

Along the way you'll learn the model bring-up methodology: how to create standalone packages that integrate with TT-Metal without modifying the core repository.

What you'll build

These frames were generated by Phase 2 running on a P300C (Blackhole):

Prompt: "1939 World's Fair imagined from the year 2099, art deco spires at golden dusk, retro-futurist optimism, cinematic 4K"	Prompt: "purple phosphor glow across distant mountains at 2am, retro CRT haze, cyan mist drifting through valleys, cinematic"

8 frames × 25 denoising steps = 121 seconds on P300C.

What is AnimateDiff?

AnimateDiff adds temporal attention to Stable Diffusion 1.4 by injecting TemporalTransformer blocks into every BasicTransformerBlock in the UNet — at the 320-dim feature level where the motion weights were trained:

SD 1.4 UNet WITHOUT MotionAdapter:
  Noise → [Down blocks] → [Mid block] → [Up blocks] → Denoised latent
           each block: BasicTransformerBlock(spatial attention only)

SD 1.4 UNet WITH MotionAdapter (Phase 1):
  Noise → [Down blocks] → [Mid block] → [Up blocks] → Denoised latent
           each block: BasicTransformerBlock
                         └── spatial attention (unchanged, 320-dim)
                         └── TemporalTransformer (cross-frame, 320-dim)
                                            ↑
                               mm_sd_v15_v2.ckpt weights here

Why SD 1.4, not SD 3.5? AnimateDiff motion weights (mm_sd_v15_v2.ckpt) were trained for SD 1.5's UNet with 320-dim transformer blocks. SD 3.5 uses a DiT with 2432-dim blocks — architecturally incompatible. The diffusers AnimateDiffPipeline handles MotionAdapter injection automatically when paired with the SD 1.4 base model.

How temporal attention works:

# Input: (batch*frames, spatial_tokens, channels)
# e.g.: (16, 4096, 320) for 16 frames of 64×64 latents

# Reshape to expose frame dimension
hidden = hidden.view(batch, frames, spatial, channels)
hidden = hidden.permute(0, 2, 1, 3)           # (b, spatial, frames, c)
hidden = hidden.reshape(batch*spatial, frames, channels)
# e.g.: (4096, 16, 320)
# Standard attention across the 16 frames → temporal coherence

Frames attend across each other at every denoising step — motion is baked in, not post-processed.

Phase 2 tradeoff: The TTNN UNet does not currently have TemporalTransformer blocks, so frames are denoised sequentially with shared base noise for coherence. This gives ~15 s/frame vs ~2 min/frame for Phase 1. Adding full temporal attention to the TTNN UNet is future work.

Step 1: Deploy the project to your scratchpad

The AnimateDiff package is bundled with the extension. This copies it to ~/tt-scratchpad/tt-animatediff/ and installs it as an editable Python package:

📦 Setup AnimateDiff Project

VS Code

mkdir -p ~/tt-scratchpad/tt-animatediff && cp -r "{{projectPath}}"/* ~/tt-scratchpad/tt-animatediff/ && cd ~/tt-scratchpad/tt-animatediff && pip install -e . && python3 -c "import animatediff_ttnn; print(\

What this does:

Copies animatediff_ttnn/, examples/, setup.py to ~/tt-scratchpad/tt-animatediff/
Runs pip install -e . — your editable copy, ready to modify

Project structure you'll have:

~/tt-scratchpad/tt-animatediff/
├── animatediff_ttnn/
│   ├── pipeline.py           # Phase 1: CPU AnimateDiffPipeline wrapper
│   └── ttnn_pipeline.py      # Phase 2: Blackhole TTNN UNet + PNDM scheduler
├── examples/
│   ├── generate_baseline.py  # Phase 1 (CPU, any hardware)
│   └── generate_blackhole.py # Phase 2 (Blackhole hardware)
├── output/                   # Generated GIFs land here
└── setup.py

Step 2: Download the models

# SD 1.4 — required for both phases
hf download CompVis/stable-diffusion-v1-4

# AnimateDiff motion adapter — Phase 1 only
hf download guoyww/animatediff-motion-adapter-v1-5-2

Step 3: Phase 1 — CPU AnimateDiffPipeline

The diffusers AnimateDiffPipeline loads SD 1.4, injects the MotionAdapter at every transformer block, and denoises ALL frames simultaneously with temporal attention. This gives true frame-to-frame coherence from the latent-space denoising.

🎬 Run Phase 1 (CPU)

VS Code

ls -lh ~/tt-scratchpad/tt-animatediff/output/ 2>/dev/null || echo "No output yet — run Phase 1 or Phase 2 first"

Expected output (output/phase1.gif):

Loading pipeline...
Generating frames: 100%|██████████| 25/25
Saved 8 frames → output/phase1.gif

Performance: ~2 min/frame on CPU, ~12–21 s/frame on N150/N300.

What happens inside:

from diffusers import AnimateDiffPipeline, MotionAdapter

adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
pipe = AnimateDiffPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", motion_adapter=adapter)

# All 8 frames denoised together — temporal attention at every step
result = pipe(prompt=prompt, num_frames=8, num_inference_steps=25)
frames = result.frames[0]  # List of PIL Images

Or run it directly from your scratchpad with custom prompts:

cd ~/tt-scratchpad/tt-animatediff
python examples/generate_baseline.py \
    --prompt "purple phosphor glow across distant mountains at 2am, retro CRT haze, cinematic" \
    --frames 8 --steps 25 --output output/phosphor_cpu.gif

Step 4: Phase 2 — Blackhole TTNN UNet

Replaces the PyTorch UNet with the TTNN UNet from ~/tt-metal/models/demos/wormhole/stable_diffusion/. The denoising loop runs on Blackhole; latents are decoded with the CPU PyTorch VAE (the TTNN VAE OOMs on Blackhole's final conv_out due to a L1 grid mismatch in the Wormhole-targeted kernel — see Known Limitations below).

Requires: Blackhole hardware (P100/P150/P300C/QB2) and ~/tt-metal built.

⚡ Run Phase 2 (Blackhole)

Expected output (output/blackhole.gif):

AnimateDiff Phase 2 — Blackhole TTNN UNet
  Prompt    : 1939 World's Fair imagined from the year 2099, art deco spires at golden dusk, retro-futurist optimism, cinematic 4K
  Frames    : 8  Steps: 25

Opening Blackhole device...
Loading SD 1.4 models onto Blackhole...
  Models loaded in 7.3s
Encoding prompts with CLIP...
Generating 8 frames on Blackhole...
  Frame 1/8 done
  ...
  Frame 8/8 done
  Generated in 121.0s (15.1s/frame)
Saved 8 frames -> output/blackhole.gif

Performance on P300C: 7s model load + ~15s/frame. Kernel compilation ~2–3 min on first run (cached after).

Or run directly with custom prompts:

cd ~/tt-metal && source python_env/bin/activate
export TT_METAL_HOME=~/tt-metal TT_METAL_ARCH_NAME=blackhole
cd ~/tt-scratchpad/tt-animatediff
python examples/generate_blackhole.py \
    --prompt "your prompt here" --frames 8 --steps 25

Step 5: View your output

📁 View Output Files

GIFs are in ~/tt-scratchpad/tt-animatediff/output/. Open them in any image viewer or browser.

How Phase 2 works

animatediff_ttnn/ttnn_pipeline.py — the Blackhole pipeline:

def generate_frames(device, ttnn_model, torch_vae, config, ttnn_scheduler, ...):
    for frame_idx in range(num_frames):
        # Reset PNDM scheduler state (counter, ets buffer) before each frame
        ttnn_scheduler.set_timesteps(num_steps)

        # Shared base noise + small per-frame perturbation = inter-frame coherence
        frame_noise = base_noise + 0.05 * torch.randn_like(base_noise)
        ttnn_latents = ttnn.from_torch(frame_noise, ...)

        # Full PNDM denoising loop — runs on Blackhole TTNN
        for index in range(len(time_step)):
            latent_input = ttnn.concat([ttnn_latents, ttnn_latents], dim=0)  # CFG
            noise_pred = ttnn_model(latent_input, timestep=_tlist[index], ...)
            noise_pred = tt_guide(noise_pred, guidance_scale)
            ttnn_latents = ttnn_scheduler.step(noise_pred, t, ttnn_latents).prev_sample

        # Decode with PyTorch VAE on CPU (TTNN VAE OOMs on Blackhole conv_out)
        latents_cpu = ttnn.to_torch(ttnn_latents).float() / 0.18215
        decoded = torch_vae.decode(latents_cpu).sample

CLIP encoding uses the text encoder bundled inside SD 1.4 — no separate model download:

tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="text_encoder")
# Pad 77 → 96 tokens: TTNN UNet expects 96-token sequences
embeds = torch.nn.functional.pad(embeds, (0, 0, 0, 19))  # (1, 96, 768)

Prompt tips

SD 1.4 responds well to photography-style prompts:

Goal	Prompt
Retro-futurist city	`"1939 World's Fair imagined from the year 2099, art deco spires at golden dusk, cinematic 4K"`
Phosphor landscape	`"purple phosphor glow across distant mountains at 2am, retro CRT haze, cyan mist, cinematic"`
Night sky	`"starry night sky over mountains, long exposure, 4K"`
Abstract	`"colorful aurora borealis, northern lights, long exposure"`

Tuning coherence: The 0.05 noise perturbation in ttnn_pipeline.py controls frame variation. Edit ~/tt-scratchpad/tt-animatediff/animatediff_ttnn/ttnn_pipeline.py to adjust — higher values give more frame-to-frame motion.

The model bring-up methodology

What this project demonstrates is the complete workflow for integrating any new model:

Phase 1: Research (1–2 hours)

Clone the reference implementation (guoyww/AnimateDiff)
Download model weights to understand their structure
Read the core architecture files (not just papers — code reveals reality)
Document key patterns: reshaping logic, injection points, weight keys

Phase 2: Design (30–60 min)

Choose standalone vs integrated approach — standalone wins for maintainability
Create project structure outside ~/tt-metal/ (your code, your ownership)
Define the API surface
Identify the TT-Metal integration boundary

Phase 3: Implementation (2–4 hours)

Start with PyTorch — easier to debug, matches reference
Implement the core module first
Build the high-level wrapper second
Add the TTNN path last, after PyTorch is validated

Phase 4: Packaging (1 hour)

setup.py + requirements.txt makes it pip install -e .-able
Example scripts show exactly how to run it
README documents what works and what doesn't

Phase 5: Validate on hardware (1–2 hours)

First run: expect kernel compilation ~2–3 min — that's normal
Verify output shapes match expected latent dimensions
Check VAE decode produces recognizable images
Benchmark: frames/second, memory usage

Total for a complete new model: 6–10 hours. This is the path from demos to real applications.

Known limitations

Issue	Status
TTNN VAE OOMs on Blackhole `conv_out`	Workaround: CPU PyTorch VAE decode
No TemporalTransformer in TTNN UNet	Phase 1 only; Phase 2 uses shared-noise coherence
`DispatchCoreAxis.ROW` crashes on Blackhole	Avoided: `setup_blackhole()` uses auto-detect
First run 2–3 min kernel compilation	Expected; cached after first run

What's next

Add temporal attention to the TTNN UNet

Full Phase 2 would inject TemporalTransformer blocks into the TTNN UNet's BasicTransformerBlock instances — the same injection pattern as Phase 1's MotionAdapter, but in TTNN ops. This would bring true AnimateDiff temporal coherence to Blackhole-accelerated generation.

Apply this pattern to other models

The standalone package pattern works for any model:

ControlNet — conditioning inputs for SD 1.4
LoRA — weight delta injection into the SD UNet
IP-Adapter — image-conditioned generation
Any PyTorch model — wrap it, validate on CPU, port to TTNN