Skip to content

SD.Next Documentation

Training Methods

StableDiffusion Training Methods

Fine-tuning

Retrains hypernetwork layers with new data, modifying original weights
Requires large, precisely labeled datasets
Model size: ~2–7 GB (same as original)
Status: Not practical; prohibitive dataset and effort requirements

Model Merge

Combines weights from multiple models using specified algorithms
Status: Highly desired for creating use-case-specific model variants

Textual Inversion

Assigns vectors to new concepts by expanding the model's vocabulary
Learned content is assembled from existing concept combinations rather than new features
Can be viewed as a formula for combining existing weights to represent new concepts
Model size: 768–1024 bytes per vector
Status: Best current short-term training solution

Aesthetic Gradient

Uses low-precision embeddings to steer CLIP via classifier guidance
Training is cheap, but classifier guidance slows generation
Results in basic style transfer from learned images to generated output
Model size: Same as embedding
Status: Inconsistent results; limited practical value

Custom Diffusion

Fine-tunes specific model matrices using textual inversion techniques
Similar memory and speed to embedding training; reportedly better results in fewer steps
Model size: ~50 MB
Status: Promising but requires further investigation; limited community discussion

Hypernetwork

Similar to fine-tuning but adds a small neural network that dynamically modifies weights in the model's final layers
Acts as an adaptive layer for style steering rather than concept transfer
Model size: ~100–200 MB (limited to learned layers)
Status: Lower priority; concept transfer is more useful than style transfer

Null-Text Inversion

Similar to textual inversion but trains an unconditional embedding for classifier-free guidance
Reportedly more detailed than standard text embeddings
Model size: Larger than textual inversion but comparable
Status: Promising but requires further investigation; no working prototype yet

CLIP Inversion

Similar to textual inversion but uses CLIP embeddings instead of text embeddings
Model size: Same as textual inversion
Status: Prohibitive; requires a specially fine-tuned model as the starting point

Dream Artist

Variation of textual inversion training that creates both positive and negative embeddings
Model size: Same as textual inversion
Status: Skip for now; maintenance appears insufficient

DreamBooth

Similar to fine-tuning but adds information on top of the model without overwriting existing concepts
Model size: ~2–7 GB (same as original model)
Status: Prohibitive; requires loading full model and generates large output

LoRA

"Low-Rank Adaptation of Large Language Models" — injects trainable layers into cross-attention layers.

Highly flexible but memory-intensive; limits training on typical GPUs
Multiple incompatible implementations; choose one carefully
Model size: ~5 MB to full model size; typical: ~150–300 MB