Stable Diffusion Pipeline
For comprehensive explanations, see: - How Stable Diffusion Works (end-to-end overview) - Understanding Diffusion Probabilistic Models (technical deep dive)
Pipeline Steps
-
Encoder / Conditioning — Convert text (via tokenizer) or images (via vision model) into semantic maps (e.g., CLIP text encoder)
-
Sampler — Generate starting noise as the basis for content generation (e.g., K-LMS)
-
Diffuser — Create vector content from noise and semantic map (e.g., Stable Diffusion checkpoint)
-
Autoencoder — Map between latent and pixel space to generate actual images (e.g., VAE)
-
Denoising — Refine pixel output using diffuser information and blend results (e.g., U-NET)
-
Iterative Refinement — Repeat steps 3–5 with cross-attention to progressively blend and improve results
-
Optional Post-Processing — Apply additional models as needed (e.g., ESRGAN for upscaling)