Native Video Animation with AnimateDiff
Run SD 1.4 video generation on Blackhole — 15 seconds per frame, real images, no CPU fallback on the UNet.
This lesson shows two paths to generating animated videos:
- Phase 1 (any hardware) —
diffusersAnimateDiffPipelineon CPU, full temporal attention via MotionAdapter, ~2 min/frame - Phase 2 (Blackhole) — TTNN UNet on Blackhole, sequential denoising, ~15 seconds/frame
Along the way you'll learn the model bring-up methodology: how to create standalone packages that integrate with TT-Metal without modifying the core repository.
What you'll build
These frames were generated by Phase 2 running on a P300C (Blackhole):
| Prompt: "1939 World's Fair imagined from the year 2099, art deco spires at golden dusk, retro-futurist optimism, cinematic 4K" | Prompt: "purple phosphor glow across distant mountains at 2am, retro CRT haze, cyan mist drifting through valleys, cinematic" |
|---|---|
![]() |
![]() |
8 frames × 25 denoising steps = 121 seconds on P300C.
What is AnimateDiff?
AnimateDiff adds temporal attention to Stable Diffusion 1.4 by injecting TemporalTransformer blocks into every BasicTransformerBlock in the UNet — at the 320-dim feature level where the motion weights were trained:
SD 1.4 UNet WITHOUT MotionAdapter:
Noise → [Down blocks] → [Mid block] → [Up blocks] → Denoised latent
each block: BasicTransformerBlock(spatial attention only)
SD 1.4 UNet WITH MotionAdapter (Phase 1):
Noise → [Down blocks] → [Mid block] → [Up blocks] → Denoised latent
each block: BasicTransformerBlock
└── spatial attention (unchanged, 320-dim)
└── TemporalTransformer (cross-frame, 320-dim)
↑
mm_sd_v15_v2.ckpt weights here
Why SD 1.4, not SD 3.5? AnimateDiff motion weights (mm_sd_v15_v2.ckpt) were trained for SD 1.5's UNet with 320-dim transformer blocks. SD 3.5 uses a DiT with 2432-dim blocks — architecturally incompatible. The diffusers AnimateDiffPipeline handles MotionAdapter injection automatically when paired with the SD 1.4 base model.
How temporal attention works:
# Input: (batch*frames, spatial_tokens, channels)
# e.g.: (16, 4096, 320) for 16 frames of 64×64 latents
# Reshape to expose frame dimension
hidden = hidden.view(batch, frames, spatial, channels)
hidden = hidden.permute(0, 2, 1, 3) # (b, spatial, frames, c)
hidden = hidden.reshape(batch*spatial, frames, channels)
# e.g.: (4096, 16, 320)
# Standard attention across the 16 frames → temporal coherence
Frames attend across each other at every denoising step — motion is baked in, not post-processed.
Phase 2 tradeoff: The TTNN UNet does not currently have TemporalTransformer blocks, so frames are denoised sequentially with shared base noise for coherence. This gives ~15 s/frame vs ~2 min/frame for Phase 1. Adding full temporal attention to the TTNN UNet is future work.
Step 1: Deploy the project to your scratchpad
The AnimateDiff package is bundled with the extension. This copies it to ~/tt-scratchpad/tt-animatediff/ and installs it as an editable Python package:
mkdir -p ~/tt-scratchpad/tt-animatediff && cp -r "{{projectPath}}"/* ~/tt-scratchpad/tt-animatediff/ && cd ~/tt-scratchpad/tt-animatediff && pip install -e . && python3 -c "import animatediff_ttnn; print(\What this does:
- Copies
animatediff_ttnn/,examples/,setup.pyto~/tt-scratchpad/tt-animatediff/ - Runs
pip install -e .— your editable copy, ready to modify
Project structure you'll have:
~/tt-scratchpad/tt-animatediff/
├── animatediff_ttnn/
│ ├── pipeline.py # Phase 1: CPU AnimateDiffPipeline wrapper
│ └── ttnn_pipeline.py # Phase 2: Blackhole TTNN UNet + PNDM scheduler
├── examples/
│ ├── generate_baseline.py # Phase 1 (CPU, any hardware)
│ └── generate_blackhole.py # Phase 2 (Blackhole hardware)
├── output/ # Generated GIFs land here
└── setup.py
Step 2: Download the models
# SD 1.4 — required for both phases
hf download CompVis/stable-diffusion-v1-4
# AnimateDiff motion adapter — Phase 1 only
hf download guoyww/animatediff-motion-adapter-v1-5-2
Step 3: Phase 1 — CPU AnimateDiffPipeline
The diffusers AnimateDiffPipeline loads SD 1.4, injects the MotionAdapter at every transformer block, and denoises ALL frames simultaneously with temporal attention. This gives true frame-to-frame coherence from the latent-space denoising.
ls -lh ~/tt-scratchpad/tt-animatediff/output/ 2>/dev/null || echo "No output yet — run Phase 1 or Phase 2 first"
Expected output (output/phase1.gif):
Loading pipeline...
Generating frames: 100%|██████████| 25/25
Saved 8 frames → output/phase1.gif
Performance: ~2 min/frame on CPU, ~12–21 s/frame on N150/N300.
What happens inside:
from diffusers import AnimateDiffPipeline, MotionAdapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
pipe = AnimateDiffPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", motion_adapter=adapter)
# All 8 frames denoised together — temporal attention at every step
result = pipe(prompt=prompt, num_frames=8, num_inference_steps=25)
frames = result.frames[0] # List of PIL Images
Or run it directly from your scratchpad with custom prompts:
cd ~/tt-scratchpad/tt-animatediff
python examples/generate_baseline.py \
--prompt "purple phosphor glow across distant mountains at 2am, retro CRT haze, cinematic" \
--frames 8 --steps 25 --output output/phosphor_cpu.gif
Step 4: Phase 2 — Blackhole TTNN UNet
Replaces the PyTorch UNet with the TTNN UNet from ~/tt-metal/models/demos/wormhole/stable_diffusion/. The denoising loop runs on Blackhole; latents are decoded with the CPU PyTorch VAE (the TTNN VAE OOMs on Blackhole's final conv_out due to a L1 grid mismatch in the Wormhole-targeted kernel — see Known Limitations below).
Requires: Blackhole hardware (P100/P150/P300C/QB2) and ~/tt-metal built.
⚡ Run Phase 2 (Blackhole)
Expected output (output/blackhole.gif):
AnimateDiff Phase 2 — Blackhole TTNN UNet
Prompt : 1939 World's Fair imagined from the year 2099, art deco spires at golden dusk, retro-futurist optimism, cinematic 4K
Frames : 8 Steps: 25
Opening Blackhole device...
Loading SD 1.4 models onto Blackhole...
Models loaded in 7.3s
Encoding prompts with CLIP...
Generating 8 frames on Blackhole...
Frame 1/8 done
...
Frame 8/8 done
Generated in 121.0s (15.1s/frame)
Saved 8 frames -> output/blackhole.gif
Performance on P300C: 7s model load + ~15s/frame. Kernel compilation ~2–3 min on first run (cached after).
Or run directly with custom prompts:
cd ~/tt-metal && source python_env/bin/activate
export TT_METAL_HOME=~/tt-metal TT_METAL_ARCH_NAME=blackhole
cd ~/tt-scratchpad/tt-animatediff
python examples/generate_blackhole.py \
--prompt "your prompt here" --frames 8 --steps 25
Step 5: View your output
📁 View Output Files
GIFs are in ~/tt-scratchpad/tt-animatediff/output/. Open them in any image viewer or browser.
How Phase 2 works
animatediff_ttnn/ttnn_pipeline.py — the Blackhole pipeline:
def generate_frames(device, ttnn_model, torch_vae, config, ttnn_scheduler, ...):
for frame_idx in range(num_frames):
# Reset PNDM scheduler state (counter, ets buffer) before each frame
ttnn_scheduler.set_timesteps(num_steps)
# Shared base noise + small per-frame perturbation = inter-frame coherence
frame_noise = base_noise + 0.05 * torch.randn_like(base_noise)
ttnn_latents = ttnn.from_torch(frame_noise, ...)
# Full PNDM denoising loop — runs on Blackhole TTNN
for index in range(len(time_step)):
latent_input = ttnn.concat([ttnn_latents, ttnn_latents], dim=0) # CFG
noise_pred = ttnn_model(latent_input, timestep=_tlist[index], ...)
noise_pred = tt_guide(noise_pred, guidance_scale)
ttnn_latents = ttnn_scheduler.step(noise_pred, t, ttnn_latents).prev_sample
# Decode with PyTorch VAE on CPU (TTNN VAE OOMs on Blackhole conv_out)
latents_cpu = ttnn.to_torch(ttnn_latents).float() / 0.18215
decoded = torch_vae.decode(latents_cpu).sample
CLIP encoding uses the text encoder bundled inside SD 1.4 — no separate model download:
tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="text_encoder")
# Pad 77 → 96 tokens: TTNN UNet expects 96-token sequences
embeds = torch.nn.functional.pad(embeds, (0, 0, 0, 19)) # (1, 96, 768)
Prompt tips
SD 1.4 responds well to photography-style prompts:
| Goal | Prompt |
|---|---|
| Retro-futurist city | "1939 World's Fair imagined from the year 2099, art deco spires at golden dusk, cinematic 4K" |
| Phosphor landscape | "purple phosphor glow across distant mountains at 2am, retro CRT haze, cyan mist, cinematic" |
| Night sky | "starry night sky over mountains, long exposure, 4K" |
| Abstract | "colorful aurora borealis, northern lights, long exposure" |
Tuning coherence: The 0.05 noise perturbation in ttnn_pipeline.py controls frame variation. Edit ~/tt-scratchpad/tt-animatediff/animatediff_ttnn/ttnn_pipeline.py to adjust — higher values give more frame-to-frame motion.
The model bring-up methodology
What this project demonstrates is the complete workflow for integrating any new model:
Phase 1: Research (1–2 hours)
- Clone the reference implementation (
guoyww/AnimateDiff) - Download model weights to understand their structure
- Read the core architecture files (not just papers — code reveals reality)
- Document key patterns: reshaping logic, injection points, weight keys
Phase 2: Design (30–60 min)
- Choose standalone vs integrated approach — standalone wins for maintainability
- Create project structure outside
~/tt-metal/(your code, your ownership) - Define the API surface
- Identify the TT-Metal integration boundary
Phase 3: Implementation (2–4 hours)
- Start with PyTorch — easier to debug, matches reference
- Implement the core module first
- Build the high-level wrapper second
- Add the TTNN path last, after PyTorch is validated
Phase 4: Packaging (1 hour)
setup.py+requirements.txtmakes itpip install -e .-able- Example scripts show exactly how to run it
- README documents what works and what doesn't
Phase 5: Validate on hardware (1–2 hours)
- First run: expect kernel compilation ~2–3 min — that's normal
- Verify output shapes match expected latent dimensions
- Check VAE decode produces recognizable images
- Benchmark: frames/second, memory usage
Total for a complete new model: 6–10 hours. This is the path from demos to real applications.
Known limitations
| Issue | Status |
|---|---|
TTNN VAE OOMs on Blackhole conv_out |
Workaround: CPU PyTorch VAE decode |
| No TemporalTransformer in TTNN UNet | Phase 1 only; Phase 2 uses shared-noise coherence |
DispatchCoreAxis.ROW crashes on Blackhole |
Avoided: setup_blackhole() uses auto-detect |
| First run 2–3 min kernel compilation | Expected; cached after first run |
What's next
Add temporal attention to the TTNN UNet
Full Phase 2 would inject TemporalTransformer blocks into the TTNN UNet's BasicTransformerBlock instances — the same injection pattern as Phase 1's MotionAdapter, but in TTNN ops. This would bring true AnimateDiff temporal coherence to Blackhole-accelerated generation.
Apply this pattern to other models
The standalone package pattern works for any model:
- ControlNet — conditioning inputs for SD 1.4
- LoRA — weight delta injection into the SD UNet
- IP-Adapter — image-conditioned generation
- Any PyTorch model — wrap it, validate on CPU, port to TTNN

