N150 N300 T3K P100 P300C 20 min Validated

Image Generation with Stable Diffusion XL

Generate images on your Tenstorrent hardware using Stable Diffusion XL Base - turn text prompts into high-resolution images powered by your hardware!

What is Stable Diffusion XL?

Stable Diffusion XL Base is a powerful text-to-image diffusion model that generates high-quality 1024x1024 images from text descriptions. SDXL uses a two-stage architecture with dual text encoders (CLIP-L and OpenCLIP-G) for improved prompt understanding.

Why Image Generation on Tenstorrent?

Journey So Far

Architecture

Stable Diffusion XL uses a two-stage architecture with dual text encoders:

┌──────────────────────────────────────┐
│     Text Prompt                      │
│  "If Tenstorrent were a company      │
│   in the 1960s and 1970s"            │
└─────────────┬────────────────────────┘
              │
              ▼
     ┌────────────────────────────┐
     │ Dual Text Encoders:        │
     │ • CLIP-L (OpenAI)          │ ← Encode text to embeddings
     │ • OpenCLIP-G (laion)       │   (pooled + sequence)
     └────────┬───────────────────┘
              │
              ▼
     ┌────────────────────────────┐
     │ UNet Diffusion Model       │ ← Generate latent representation
     │ Running on TT Hardware     │    (28-50 denoising steps)
     │ Cross-attention layers     │
     └────────┬───────────────────┘
              │
              ▼
     ┌────────────────┐
     │ VAE Decoder    │ ← Convert latents to 1024x1024 pixels
     └────────┬───────┘
              │
              ▼
     ┌────────────────┐
     │ Generated      │
     │ Image (PNG)    │
     └────────────────┘

Hardware Compatibility

Stable Diffusion XL Base runs on Tenstorrent hardware with native TT-NN acceleration (not CPU fallback!):

Hardware Status Performance Notes
N150 (Wormhole) ✅ Supported ~12-15 sec/image Optimized single-chip config
N300 (Wormhole) ✅ Supported ~8-10 sec/image Faster with 2 chips
P100 (Blackhole) ⚠️ Experimental ~12-15 sec/image Same Blackhole arch as P300c
P300c (Blackhole) ⚠️ Experimental ~12-15 sec/image Single Blackhole chip; use MESH_DEVICE=P100
T3K (Wormhole) ✅ Supported ~5-8 sec/image Production scale (8 chips)

All hardware benefits from native TT-NN acceleration! The model runs directly on Tensix cores using hardware-specific operators.

Check Your Hardware

Quick Check: Not sure which hardware you have?

🔍 Detect Hardware
tt-smi

Look for the "Board Type" field in the output (e.g., n150, n300, t3k, p100).


Prerequisites


Model: Stable Diffusion XL Base

We'll use Stable Diffusion XL Base 1.0 which runs natively on Tenstorrent hardware using tt-metal.

Model Details:

✨ v0.65.1 Improvements:

💡 Lighter Alternative: For faster iteration or testing, Stable Diffusion v1.4 is also available (models/demos/wormhole/stable_diffusion/) and generates 512×512 images in ~8-10 seconds on N150. Great for development!

Step 1: Authenticate with Hugging Face

The model will be automatically downloaded from Hugging Face the first time you run it. Login to enable downloading:

hf auth login --token "$HF_TOKEN"

🔐 Login to Hugging Face
hf auth login --token "$HF_TOKEN"

Note: SDXL Base 1.0 is publicly available and doesn't require special access permissions.

Step 2: Configure for Your Hardware

Set the appropriate mesh device environment variable for your hardware:

🔧 N150 (Wormhole - Single Chip) - Most common
export MESH_DEVICE=N150

Performance: ~12-15 seconds per 1024x1024 image

🔧 N300 (Wormhole - Dual Chip)
export MESH_DEVICE=N300

Performance: ~8-10 seconds per 1024x1024 image (faster with 2 chips!)

🔧 T3K (Wormhole - 8 Chips)
export MESH_DEVICE=T3K

Performance: ~5-8 seconds per 1024x1024 image (production speed!)

🔧 P100 (Blackhole - Single Chip)
export MESH_DEVICE=P100
export TT_METAL_ARCH_NAME=blackhole  # Required for Blackhole

Performance: ~12-15 seconds per 1024x1024 image (similar to N150)

⚠️ Note: Blackhole SDXL support is experimental. Please report any issues!

🔧 P300c (Blackhole - Single Chip / QB2)
export MESH_DEVICE=P100          # P300c runs in single-chip P100 mode
export TT_METAL_ARCH_NAME=blackhole

Performance: ~12-15 seconds per 1024x1024 image

P300c is a single Blackhole chip — identical instruction set to P100. Use MESH_DEVICE=P100 for all single-chip Blackhole lessons.

QB2 note: QB2 ships without ~/tt-metal. You must clone and build tt-metal from source before running SDXL. See Build tt-metal from Source.

⚠️ Note: Blackhole SDXL support is experimental. Please report any issues!


What this does:

Step 3: Generate Your First Image

Run the Stable Diffusion XL demo with a sample prompt (using the MESH_DEVICE you set in Step 2):

mkdir -p ~/tt-scratchpad
cd ~/tt-scratchpad
export PYTHONPATH=~/tt-metal:$PYTHONPATH
# Use the MESH_DEVICE you set in Step 2 (N150, N300, T3K, or P100)

# Run with default prompt
pytest ~/tt-metal/models/experimental/stable_diffusion_xl_base/demo/demo.py

🎨 Generate Sample Image

What you'll see:

Loading Stable Diffusion XL Base from stabilityai...
✓ Model loaded from stabilityai/stable-diffusion-xl-base-1.0
✓ Initializing UNet on TT hardware
✓ Encoders loaded (CLIP-L + OpenCLIP-G)

Generating 1024x1024 image (28-50 inference steps)...
Processing... (first generation takes longer - model compilation + warmup)
Decoding with VAE...

✓ Image generation complete!
✓ Image saved to: output directory
Generation time: [varies by hardware - see Step 2 performance notes]

The generated image will be saved according to the test configuration.

Step 5: Interactive Mode - Try Your Own Prompts

Run in interactive mode to generate multiple images with custom prompts (using your MESH_DEVICE from Step 3):

mkdir -p ~/tt-scratchpad
cd ~/tt-scratchpad
export PYTHONPATH=~/tt-metal:$PYTHONPATH
# Use the MESH_DEVICE you set in Step 3

# Run interactive mode
pytest ~/tt-metal/models/experimental/stable_diffusion_xl_base/demo.py

Note: The current demo.py uses pytest configuration. For a more interactive experience, see the "Create Your Own Demo" section below.

Example prompts to try:

Literary & Cultural References

  1. Steinbeck's Computing Dust Bowl:
   "The Grapes of Wrath reimagined as 1970s computer lab, orange terminals, dusty atmosphere, vintage photograph, film grain"
  1. Kerouac's Electric Highway:
   "On the Road meets Silicon Valley, beat generation aesthetic, vintage mainframe computers, dharma bums coding, 1960s photography"
  1. Gertrude Stein's Repetition Machine:
   "A rose is a rose is a processor, cubist computing, abstract geometric circuit boards, modernist aesthetic, orange and purple"
  1. Whole Earth Catalog Computer Lab:
   "1970s alternative technology workshop, homebrew computer club, Stewart Brand aesthetic, orange and brown, democratic tools, vintage catalog photography"

Classic Movie Computing Quotes

  1. Chocolate-Powered AI:
   "What would a computer do with a lifetime supply of chocolate? Willy Wonka meets mainframe, whimsical vintage computing, 1970s aesthetic, orange accents"
  1. WarGames WOPR:
   "Would you like to play a game? Cold War computing aesthetic, NORAD command center, green phosphor terminals, dramatic lighting, 1980s photography"

Decidedly Tenstorrent

  1. Tensix Mandelbrot Dreams:
   "880 RISC-V cores dreaming of fractals, purple and orange silicon wafer, crystalline structure, technical diagram meets abstract art"
  1. Orange Silicon Valley:
   "AI accelerator as California poppy field, orange blooms, Tenstorrent hardware, golden hour lighting, Stanford Foothills, technical beauty"
  1. Network-on-Chip Landscape:
   "NoC topology as ancient trade routes, silicon pathways, orange and purple, cartography meets chip design, vintage map aesthetic"
  1. The Tensor Processing Saloon:
   "Wild West saloon but it's a 1970s computer lab, orange terminals, cowboys coding RISC-V assembly, vintage Americana, film photograph"

Example Output

Here's what you can create with Stable Diffusion XL on Tenstorrent hardware:

Snowy Cabin - Generated with Stable Diffusion XL

Generated with prompt: "A cozy cabin in a snowy forest, warm lights in windows, winter evening, oil painting style"

Generation details:


Step 5: Create Your Own Interactive Demo (Advanced)

Want a simpler, more interactive experience? The pytest-based demo is powerful but complex. You can create a simplified demo script:

# ~/tt-scratchpad/simple_sdxl_demo.py
import ttnn
from diffusers import DiffusionPipeline
import torch

# Load model
pipeline = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float32,
    use_safetensors=True
)

# Generate image
prompt = input("Enter your prompt: ")
image = pipeline(
    prompt=prompt,
    num_inference_steps=28,
    guidance_scale=7.5
).images[0]

# Save
output_path = f"sdxl_output.png"
image.save(output_path)
print(f"✅ Image saved to: {output_path}")

This is a simpler starting point that you can customize further!

Step 5.5: Combined Base + Refiner (NEW in v0.65.x! 🎨)

Want even BETTER image quality? SDXL has a two-stage architecture: Base generates the image, Refiner enhances it!

The combined pipeline runs both stages automatically:

cd ~/tt-scratchpad
export PYTHONPATH=~/tt-metal:$PYTHONPATH
# Use your MESH_DEVICE from Step 2

# Run combined base + refiner pipeline
pytest ~/tt-metal/models/experimental/stable_diffusion_xl_base/demo/demo_base_and_refiner.py

What happens:

  1. Base model generates 1024x1024 image
  2. Refiner model enhances details, colors, and quality
  3. Result: Noticeably better quality than base alone!

Performance:

When to use combined pipeline:

Tip: Generate with base-only while developing your prompt, then run combined pipeline on your best results!

Step 6: Experiment with Code (Advanced)

Ready to go beyond button-pressing? Copy the demo to your scratchpad and modify it:

📝 Copy Demo to Scratchpad

This copies demo.py to ~/tt-scratchpad/sdxl_demo.py and opens it for editing.

What you can experiment with:

  1. Batch generation with variations:
# Generate multiple images with seed variations
prompts = [
    "Whole Earth Catalog computer lab, 1970s",
    "Kerouac typing on vintage terminal, beat aesthetic",
    "Would you like to play a game? WOPR terminal"
]

for i, prompt in enumerate(prompts):
    image = pipe(
        prompt=prompt,
        num_inference_steps=28,
        guidance_scale=3.5,
        seed=i  # Different seed for each
    )
    image.save(f"tenstorrent_{i:03d}.png")
  1. Parameter exploration:
# Try different guidance scales to see impact on adherence to prompt
for scale in [2.0, 3.5, 5.0, 7.5]:
    image = pipe(
        prompt="Tenstorrent headquarters, orange architecture",
        guidance_scale=scale
    )
    image.save(f"guidance_{scale}.png")
  1. Prompt interpolation:
# Blend between two concepts
prompts = [
    "1960s mainframe computer room",
    "futuristic AI accelerator lab"
]
# Generate with weighted combination
  1. Custom resolution experiments:
# Try different aspect ratios (must be multiples of 64)
image = pipe(
    prompt="Wide cinematic shot of vintage computing",
    width=1536,  # 16:9 aspect ratio
    height=864
)

Tips for code experiments:

Make it your own! The demo is just a starting point - modify, extend, and create your own image generation workflows.

Understanding the Generation Process

Diffusion Process in SDXL:

  1. Text Encoding - Dual encoders (CLIP-L + OpenCLIP-G) process your prompt into embeddings
  2. Start with noise - Begin with random latent representation in 128x128 latent space
  3. Denoise iteratively - UNet removes noise in 28-50 steps guided by text embeddings
  4. Each step runs on TT hardware - Native TT-NN acceleration on Tensix cores
  5. VAE Decoding - Convert 128x128 latents to 1024x1024 pixel image (8x upscaling)

Key Parameters:

num_inference_steps (28-50)

guidance_scale (7.5)

image_w, image_h (1024x1024)

seed (0)

Prompt Engineering Tips

Good prompts include:

  1. Subject - What you want to see
  2. Style - Art style, photography type
  3. Colors - Color scheme
  4. Lighting - Lighting conditions
  5. Details - Specific details to include

Example:

"Vintage 1970s office, orange and brown color scheme, retro computers,
warm lighting, film photograph, detailed, high quality"

Keywords that work well:

Performance Optimization

For faster generation on N150:

  1. Reduce steps:
    --steps 30  # Instead of 50
    

2. **Lower resolution:**
   ```bash
   --width 256 --height 256  # Instead of 512x512
  1. Use attention slicing: The script automatically enables this for N150 to reduce memory usage

Comparing Generation Speed

Hardware Steps Time Notes
CPU Only 50 ~5-10 min Very slow
N150 50 ~15-30 sec Accelerated
N300 50 ~10-20 sec Faster (2 chips)
High-end GPU 50 ~5-10 sec Comparison

Troubleshooting

Device reset between models (optional):

If you experience issues after running other models (like Llama from earlier lessons), you can reset the device:

tt-smi -r

This clears device state and memory. Usually not needed between pytest demos, but useful if:

Most pytest tests automatically clean up the device, so this is only needed if something went wrong.

Model download fails:

# Check Hugging Face authentication
hf auth whoami

# SDXL Base 1.0 is publicly available - no special access needed
# Visit: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0

Slow first generation:

Device hangs or crashes:

# Reset the device
tt-smi -r

# If that doesn't work, clear device state completely
sudo rm -rf /dev/shm/tenstorrent* /dev/shm/tt_*
tt-smi -r

What You Learned

Key takeaway: You can generate high-quality images locally on your Tenstorrent hardware, with full control over the generation process and complete privacy.

Next Steps

Experiment with:

  1. Different prompts - Try various subjects and styles
  2. Parameter tuning - Adjust steps, guidance_scale, and seed
  3. Batch generation - Create variations of successful prompts
  4. Image-to-image - Use generated images as starting points (advanced)

Advanced topics:

Resources

Happy generating! 🎨