Skip to content

Stable Diffusion Pipeline

For comprehensive explanations, see: - How Stable Diffusion Works (end-to-end overview) - Understanding Diffusion Probabilistic Models (technical deep dive)

Pipeline Steps

  1. Encoder / Conditioning — Convert text (via tokenizer) or images (via vision model) into semantic maps (e.g., CLIP text encoder)

  2. Sampler — Generate starting noise as the basis for content generation (e.g., K-LMS)

  3. Diffuser — Create vector content from noise and semantic map (e.g., Stable Diffusion checkpoint)

  4. Autoencoder — Map between latent and pixel space to generate actual images (e.g., VAE)

  5. Denoising — Refine pixel output using diffuser information and blend results (e.g., U-NET)

  6. Iterative Refinement — Repeat steps 3–5 with cross-attention to progressively blend and improve results

  7. Optional Post-Processing — Apply additional models as needed (e.g., ESRGAN for upscaling)