N150 N300 T3K P100 P300C 20 min Validated

Image Generation with Stable Diffusion XL

Generate images on your Tenstorrent hardware using Stable Diffusion XL Base - turn text prompts into high-resolution images powered by your hardware!

What is Stable Diffusion XL?

Stable Diffusion XL Base is a powerful text-to-image diffusion model that generates high-quality 1024x1024 images from text descriptions. SDXL uses a two-stage architecture with dual text encoders (CLIP-L and OpenCLIP-G) for improved prompt understanding.

Why Image Generation on Tenstorrent?

🎨 Native TT Acceleration - Runs directly on Tenstorrent hardware using tt-metal
🔒 Privacy - Your prompts and images stay private
⚡ High Resolution - Generate 1024x1024 images (vs 512x512 in older models)
🎓 Production Ready - Real hardware acceleration, not CPU fallback

Journey So Far

Lesson 3: Text generation with Llama
Lesson 4-5: Chat and API servers
Lesson 6-7: Production deployment with vLLM
Lesson 8: Image generation ← You are here

Architecture

Stable Diffusion XL uses a two-stage architecture with dual text encoders:

┌──────────────────────────────────────┐
│     Text Prompt                      │
│  "If Tenstorrent were a company      │
│   in the 1960s and 1970s"            │
└─────────────┬────────────────────────┘
              │
              ▼
     ┌────────────────────────────┐
     │ Dual Text Encoders:        │
     │ • CLIP-L (OpenAI)          │ ← Encode text to embeddings
     │ • OpenCLIP-G (laion)       │   (pooled + sequence)
     └────────┬───────────────────┘
              │
              ▼
     ┌────────────────────────────┐
     │ UNet Diffusion Model       │ ← Generate latent representation
     │ Running on TT Hardware     │    (28-50 denoising steps)
     │ Cross-attention layers     │
     └────────┬───────────────────┘
              │
              ▼
     ┌────────────────┐
     │ VAE Decoder    │ ← Convert latents to 1024x1024 pixels
     └────────┬───────┘
              │
              ▼
     ┌────────────────┐
     │ Generated      │
     │ Image (PNG)    │
     └────────────────┘

Hardware Compatibility

Stable Diffusion XL Base runs on Tenstorrent hardware with native TT-NN acceleration (not CPU fallback!):

Hardware	Status	Performance	Notes
N150 (Wormhole)	✅ Supported	~12-15 sec/image	Optimized single-chip config
N300 (Wormhole)	✅ Supported	~8-10 sec/image	Faster with 2 chips
P100 (Blackhole)	⚠️ Experimental	~12-15 sec/image	Same Blackhole arch as P300c
P300c (Blackhole)	⚠️ Experimental	~12-15 sec/image	Single Blackhole chip; use `MESH_DEVICE=P100`
T3K (Wormhole)	✅ Supported	~5-8 sec/image	Production scale (8 chips)

All hardware benefits from native TT-NN acceleration! The model runs directly on Tensix cores using hardware-specific operators.

Check Your Hardware

Quick Check: Not sure which hardware you have?

🔍 Detect Hardware

VS Code

tt-smi

Look for the "Board Type" field in the output (e.g., n150, n300, t3k, p100).

Prerequisites

tt-metal installed and working (completed Lesson 2)
Hugging Face account (for automatic model download)
Tenstorrent hardware (see compatibility table above)
~10-15 GB disk space for model weights

Model: Stable Diffusion XL Base

We'll use Stable Diffusion XL Base 1.0 which runs natively on Tenstorrent hardware using tt-metal.

Model Details:

HuggingFace Model: stabilityai/stable-diffusion-xl-base-1.0
Size: ~10 GB
Resolution: 1024x1024 images (high quality!)
Speed: ~12-15 seconds per image on N150 (varies by hardware)
Architecture: UNet with dual text encoders (CLIP-L + OpenCLIP-G)
Inference Steps: 28-50 (configurable)
Hardware: Runs on TT-NN operators (native acceleration)

✨ v0.65.1 Improvements:

Faster VAE decoding - Optimized latent-to-pixel conversion
Better encoder performance - Dual text encoders run more efficiently
Combined base+refiner - New two-stage pipeline for best quality
These improvements make SDXL even faster and better on your hardware!

💡 Lighter Alternative: For faster iteration or testing, Stable Diffusion v1.4 is also available (models/demos/wormhole/stable_diffusion/) and generates 512×512 images in ~8-10 seconds on N150. Great for development!

Step 1: Authenticate with Hugging Face

The model will be automatically downloaded from Hugging Face the first time you run it. Login to enable downloading:

hf auth login --token "$HF_TOKEN"

🔐 Login to Hugging Face

VS Code

hf auth login --token "$HF_TOKEN"

Note: SDXL Base 1.0 is publicly available and doesn't require special access permissions.

Step 2: Configure for Your Hardware

Set the appropriate mesh device environment variable for your hardware:

🔧 N150 (Wormhole - Single Chip) - Most common

export MESH_DEVICE=N150

Performance: ~12-15 seconds per 1024x1024 image

🔧 N300 (Wormhole - Dual Chip)

export MESH_DEVICE=N300

Performance: ~8-10 seconds per 1024x1024 image (faster with 2 chips!)

🔧 T3K (Wormhole - 8 Chips)

export MESH_DEVICE=T3K

Performance: ~5-8 seconds per 1024x1024 image (production speed!)

🔧 P100 (Blackhole - Single Chip)

export MESH_DEVICE=P100
export TT_METAL_ARCH_NAME=blackhole  # Required for Blackhole

Performance: ~12-15 seconds per 1024x1024 image (similar to N150)

⚠️ Note: Blackhole SDXL support is experimental. Please report any issues!

🔧 P300c (Blackhole - Single Chip / QB2)

export MESH_DEVICE=P100          # P300c runs in single-chip P100 mode
export TT_METAL_ARCH_NAME=blackhole

Performance: ~12-15 seconds per 1024x1024 image

P300c is a single Blackhole chip — identical instruction set to P100. Use MESH_DEVICE=P100 for all single-chip Blackhole lessons.

QB2 note: QB2 ships without ~/tt-metal. You must clone and build tt-metal from source before running SDXL. See Build tt-metal from Source.

⚠️ Note: Blackhole SDXL support is experimental. Please report any issues!

What this does:

Tells tt-metal to configure for your specific hardware
Optimizes model parallelization for your chip count
Enables appropriate memory management

Step 3: Generate Your First Image

Run the Stable Diffusion XL demo with a sample prompt (using the MESH_DEVICE you set in Step 2):

mkdir -p ~/tt-scratchpad
cd ~/tt-scratchpad
export PYTHONPATH=~/tt-metal:$PYTHONPATH
# Use the MESH_DEVICE you set in Step 2 (N150, N300, T3K, or P100)

# Run with default prompt
pytest ~/tt-metal/models/experimental/stable_diffusion_xl_base/demo/demo.py

🎨 Generate Sample Image

What you'll see:

Loading Stable Diffusion XL Base from stabilityai...
✓ Model loaded from stabilityai/stable-diffusion-xl-base-1.0
✓ Initializing UNet on TT hardware
✓ Encoders loaded (CLIP-L + OpenCLIP-G)

Generating 1024x1024 image (28-50 inference steps)...
Processing... (first generation takes longer - model compilation + warmup)
Decoding with VAE...

✓ Image generation complete!
✓ Image saved to: output directory
Generation time: [varies by hardware - see Step 2 performance notes]

The generated image will be saved according to the test configuration.

Step 5: Interactive Mode - Try Your Own Prompts

Run in interactive mode to generate multiple images with custom prompts (using your MESH_DEVICE from Step 3):

mkdir -p ~/tt-scratchpad
cd ~/tt-scratchpad
export PYTHONPATH=~/tt-metal:$PYTHONPATH
# Use the MESH_DEVICE you set in Step 3

# Run interactive mode
pytest ~/tt-metal/models/experimental/stable_diffusion_xl_base/demo.py

Note: The current demo.py uses pytest configuration. For a more interactive experience, see the "Create Your Own Demo" section below.

Example prompts to try:

Literary & Cultural References

Steinbeck's Computing Dust Bowl:

   "The Grapes of Wrath reimagined as 1970s computer lab, orange terminals, dusty atmosphere, vintage photograph, film grain"

Kerouac's Electric Highway:

   "On the Road meets Silicon Valley, beat generation aesthetic, vintage mainframe computers, dharma bums coding, 1960s photography"

Gertrude Stein's Repetition Machine:

   "A rose is a rose is a processor, cubist computing, abstract geometric circuit boards, modernist aesthetic, orange and purple"

Whole Earth Catalog Computer Lab:

   "1970s alternative technology workshop, homebrew computer club, Stewart Brand aesthetic, orange and brown, democratic tools, vintage catalog photography"

Classic Movie Computing Quotes

Chocolate-Powered AI:

   "What would a computer do with a lifetime supply of chocolate? Willy Wonka meets mainframe, whimsical vintage computing, 1970s aesthetic, orange accents"

WarGames WOPR:

   "Would you like to play a game? Cold War computing aesthetic, NORAD command center, green phosphor terminals, dramatic lighting, 1980s photography"

Decidedly Tenstorrent

Tensix Mandelbrot Dreams:

   "880 RISC-V cores dreaming of fractals, purple and orange silicon wafer, crystalline structure, technical diagram meets abstract art"

Orange Silicon Valley:

   "AI accelerator as California poppy field, orange blooms, Tenstorrent hardware, golden hour lighting, Stanford Foothills, technical beauty"

Network-on-Chip Landscape:

   "NoC topology as ancient trade routes, silicon pathways, orange and purple, cartography meets chip design, vintage map aesthetic"

The Tensor Processing Saloon:

   "Wild West saloon but it's a 1970s computer lab, orange terminals, cowboys coding RISC-V assembly, vintage Americana, film photograph"

Example Output

Here's what you can create with Stable Diffusion XL on Tenstorrent hardware:

Snowy Cabin - Generated with Stable Diffusion XL

Generated with prompt: "A cozy cabin in a snowy forest, warm lights in windows, winter evening, oil painting style"

Generation details:

Resolution: 1024x1024
Steps: 28
Hardware: N150 (single Wormhole chip)
Time: ~2-3 minutes (first run includes model load)

Step 5: Create Your Own Interactive Demo (Advanced)

Want a simpler, more interactive experience? The pytest-based demo is powerful but complex. You can create a simplified demo script:

# ~/tt-scratchpad/simple_sdxl_demo.py
import ttnn
from diffusers import DiffusionPipeline
import torch

# Load model
pipeline = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float32,
    use_safetensors=True
)

# Generate image
prompt = input("Enter your prompt: ")
image = pipeline(
    prompt=prompt,
    num_inference_steps=28,
    guidance_scale=7.5
).images[0]

# Save
output_path = f"sdxl_output.png"
image.save(output_path)
print(f"✅ Image saved to: {output_path}")

This is a simpler starting point that you can customize further!

Step 5.5: Combined Base + Refiner (NEW in v0.65.x! 🎨)

Want even BETTER image quality? SDXL has a two-stage architecture: Base generates the image, Refiner enhances it!

The combined pipeline runs both stages automatically:

cd ~/tt-scratchpad
export PYTHONPATH=~/tt-metal:$PYTHONPATH
# Use your MESH_DEVICE from Step 2

# Run combined base + refiner pipeline
pytest ~/tt-metal/models/experimental/stable_diffusion_xl_base/demo/demo_base_and_refiner.py

What happens:

Base model generates 1024x1024 image
Refiner model enhances details, colors, and quality
Result: Noticeably better quality than base alone!

Performance:

Takes about 2x longer than base-only (~25-30 sec on N150)
Worth it for final/production images
Use base-only for quick iteration, refiner for finals

When to use combined pipeline:

✅ Final production images
✅ When quality matters most
✅ Professional/commercial work
❌ Quick experimentation (stick with base-only)

Tip: Generate with base-only while developing your prompt, then run combined pipeline on your best results!

Step 6: Experiment with Code (Advanced)

Ready to go beyond button-pressing? Copy the demo to your scratchpad and modify it:

📝 Copy Demo to Scratchpad

This copies demo.py to ~/tt-scratchpad/sdxl_demo.py and opens it for editing.

What you can experiment with:

Batch generation with variations:

# Generate multiple images with seed variations
prompts = [
    "Whole Earth Catalog computer lab, 1970s",
    "Kerouac typing on vintage terminal, beat aesthetic",
    "Would you like to play a game? WOPR terminal"
]

for i, prompt in enumerate(prompts):
    image = pipe(
        prompt=prompt,
        num_inference_steps=28,
        guidance_scale=3.5,
        seed=i  # Different seed for each
    )
    image.save(f"tenstorrent_{i:03d}.png")

Parameter exploration:

# Try different guidance scales to see impact on adherence to prompt
for scale in [2.0, 3.5, 5.0, 7.5]:
    image = pipe(
        prompt="Tenstorrent headquarters, orange architecture",
        guidance_scale=scale
    )
    image.save(f"guidance_{scale}.png")

Prompt interpolation:

# Blend between two concepts
prompts = [
    "1960s mainframe computer room",
    "futuristic AI accelerator lab"
]
# Generate with weighted combination

Custom resolution experiments:

# Try different aspect ratios (must be multiples of 64)
image = pipe(
    prompt="Wide cinematic shot of vintage computing",
    width=1536,  # 16:9 aspect ratio
    height=864
)

Tips for code experiments:

Model stays loaded between generations (fast iterations!)
Save images with descriptive names: prompt_seed_guidance.png
Keep num_inference_steps=28 (optimized for SDXL)
Experiment with guidance_scale between 2.0-7.5
Use seeds for reproducibility (same seed = same image)

Make it your own! The demo is just a starting point - modify, extend, and create your own image generation workflows.

Understanding the Generation Process

Diffusion Process in SDXL:

Text Encoding - Dual encoders (CLIP-L + OpenCLIP-G) process your prompt into embeddings
Start with noise - Begin with random latent representation in 128x128 latent space
Denoise iteratively - UNet removes noise in 28-50 steps guided by text embeddings
Each step runs on TT hardware - Native TT-NN acceleration on Tensix cores
VAE Decoding - Convert 128x128 latents to 1024x1024 pixel image (8x upscaling)

Key Parameters:

num_inference_steps (28-50)

Number of denoising steps
28: Faster generation (~12-15 sec)
50: Higher quality but slower (~20-25 sec)
Configurable via pytest parameters

guidance_scale (7.5)

How closely to follow your prompt
7.5: Standard default for SDXL Base
Higher values = more literal interpretation
Lower values = more creative freedom

image_w, image_h (1024x1024)

High resolution output
Can be adjusted but 1024x1024 is optimal for SDXL

seed (0)

Random seed for reproducibility
Same seed + same prompt = same image
Useful for iterating on prompts

Prompt Engineering Tips

Good prompts include:

Subject - What you want to see
Style - Art style, photography type
Colors - Color scheme
Lighting - Lighting conditions
Details - Specific details to include

Example:

"Vintage 1970s office, orange and brown color scheme, retro computers,
warm lighting, film photograph, detailed, high quality"

Keywords that work well:

Art styles: photorealistic, digital art, oil painting, sketch
Quality: detailed, high quality, 8k, professional
Lighting: studio lighting, natural light, dramatic lighting
Camera: 35mm photograph, wide angle, close-up

Performance Optimization

For faster generation on N150:

Reduce steps:
```
--steps 30  # Instead of 50
```


2. **Lower resolution:**
   ```bash
   --width 256 --height 256  # Instead of 512x512

Use attention slicing: The script automatically enables this for N150 to reduce memory usage

Comparing Generation Speed

Hardware	Steps	Time	Notes
CPU Only	50	~5-10 min	Very slow
N150	50	~15-30 sec	Accelerated
N300	50	~10-20 sec	Faster (2 chips)
High-end GPU	50	~5-10 sec	Comparison

Troubleshooting

Device reset between models (optional):

If you experience issues after running other models (like Llama from earlier lessons), you can reset the device:

tt-smi -r

This clears device state and memory. Usually not needed between pytest demos, but useful if:

Previous demo crashed or hung
You see "out of memory" or device errors
Device behaves unexpectedly
Switching between very different workloads

Most pytest tests automatically clean up the device, so this is only needed if something went wrong.

Model download fails:

# Check Hugging Face authentication
hf auth whoami

# SDXL Base 1.0 is publicly available - no special access needed
# Visit: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0

Slow first generation:

First run downloads the model (~10 GB) which takes 5-10 minutes
First generation loads model into device (2-5 min)
Subsequent generations are much faster (~12-15 sec)
This is normal behavior

Device hangs or crashes:

# Reset the device
tt-smi -r

# If that doesn't work, clear device state completely
sudo rm -rf /dev/shm/tenstorrent* /dev/shm/tt_*
tt-smi -r

What You Learned

✅ How to set up Stable Diffusion on Tenstorrent hardware
✅ Text-to-image generation with custom prompts
✅ Understanding diffusion model parameters
✅ Prompt engineering for better results
✅ Batch generation and optimization

Key takeaway: You can generate high-quality images locally on your Tenstorrent hardware, with full control over the generation process and complete privacy.

Next Steps

Experiment with:

Different prompts - Try various subjects and styles
Parameter tuning - Adjust steps, guidance_scale, and seed
Batch generation - Create variations of successful prompts
Image-to-image - Use generated images as starting points (advanced)

Advanced topics:

Fine-tuning Stable Diffusion on custom images
Inpainting (editing parts of images)
ControlNet for precise control
Integrating with web interfaces

Resources

Stable Diffusion: stability.ai
Hugging Face Diffusers: huggingface.co/docs/diffusers
Prompt Engineering Guide: prompthero.com
TT-Metal Docs: docs.tenstorrent.com

Happy generating! 🎨