Skip to content

StableDiffusion Training Methods

Fine-tuning

  • Retrains hypernetwork layers with new data, modifying original weights
  • Requires large, precisely labeled datasets
  • Model size: ~2–7 GB (same as original)
  • Status: Not practical; prohibitive dataset and effort requirements

Model Merge

  • Combines weights from multiple models using specified algorithms
  • Status: Highly desired for creating use-case-specific model variants

Textual Inversion

  • Assigns vectors to new concepts by expanding the model's vocabulary
  • Learned content is assembled from existing concept combinations rather than new features
  • Can be viewed as a formula for combining existing weights to represent new concepts
  • Model size: 768–1024 bytes per vector
  • Status: Best current short-term training solution

Aesthetic Gradient

  • Uses low-precision embeddings to steer CLIP via classifier guidance
  • Training is cheap, but classifier guidance slows generation
  • Results in basic style transfer from learned images to generated output
  • Model size: Same as embedding
  • Status: Inconsistent results; limited practical value

Custom Diffusion

  • Fine-tunes specific model matrices using textual inversion techniques
  • Similar memory and speed to embedding training; reportedly better results in fewer steps
  • Model size: ~50 MB
  • Status: Promising but requires further investigation; limited community discussion

Hypernetwork

  • Similar to fine-tuning but adds a small neural network that dynamically modifies weights in the model's final layers
  • Acts as an adaptive layer for style steering rather than concept transfer
  • Model size: ~100–200 MB (limited to learned layers)
  • Status: Lower priority; concept transfer is more useful than style transfer

Null-Text Inversion

  • Similar to textual inversion but trains an unconditional embedding for classifier-free guidance
  • Reportedly more detailed than standard text embeddings
  • Model size: Larger than textual inversion but comparable
  • Status: Promising but requires further investigation; no working prototype yet

CLIP Inversion

  • Similar to textual inversion but uses CLIP embeddings instead of text embeddings
  • Model size: Same as textual inversion
  • Status: Prohibitive; requires a specially fine-tuned model as the starting point

Dream Artist

  • Variation of textual inversion training that creates both positive and negative embeddings
  • Model size: Same as textual inversion
  • Status: Skip for now; maintenance appears insufficient

DreamBooth

  • Similar to fine-tuning but adds information on top of the model without overwriting existing concepts
  • Model size: ~2–7 GB (same as original model)
  • Status: Prohibitive; requires loading full model and generates large output

LoRA

"Low-Rank Adaptation of Large Language Models" — injects trainable layers into cross-attention layers.

  • Highly flexible but memory-intensive; limits training on typical GPUs
  • Multiple incompatible implementations; choose one carefully
  • Model size: ~5 MB to full model size; typical: ~150–300 MB