StableDiffusion Training Methods
Fine-tuning
- Retrains hypernetwork layers with new data, modifying original weights
- Requires large, precisely labeled datasets
- Model size: ~2–7 GB (same as original)
- Status: Not practical; prohibitive dataset and effort requirements
Model Merge
- Combines weights from multiple models using specified algorithms
- Status: Highly desired for creating use-case-specific model variants
Textual Inversion
- Assigns vectors to new concepts by expanding the model's vocabulary
- Learned content is assembled from existing concept combinations rather than new features
- Can be viewed as a formula for combining existing weights to represent new concepts
- Model size: 768–1024 bytes per vector
- Status: Best current short-term training solution
Aesthetic Gradient
- Uses low-precision embeddings to steer CLIP via classifier guidance
- Training is cheap, but classifier guidance slows generation
- Results in basic style transfer from learned images to generated output
- Model size: Same as embedding
- Status: Inconsistent results; limited practical value
Custom Diffusion
- Fine-tunes specific model matrices using textual inversion techniques
- Similar memory and speed to embedding training; reportedly better results in fewer steps
- Model size: ~50 MB
- Status: Promising but requires further investigation; limited community discussion
Hypernetwork
- Similar to fine-tuning but adds a small neural network that dynamically modifies weights in the model's final layers
- Acts as an adaptive layer for style steering rather than concept transfer
- Model size: ~100–200 MB (limited to learned layers)
- Status: Lower priority; concept transfer is more useful than style transfer
Null-Text Inversion
- Similar to textual inversion but trains an unconditional embedding for classifier-free guidance
- Reportedly more detailed than standard text embeddings
- Model size: Larger than textual inversion but comparable
- Status: Promising but requires further investigation; no working prototype yet
CLIP Inversion
- Similar to textual inversion but uses CLIP embeddings instead of text embeddings
- Model size: Same as textual inversion
- Status: Prohibitive; requires a specially fine-tuned model as the starting point
Dream Artist
- Variation of textual inversion training that creates both positive and negative embeddings
- Model size: Same as textual inversion
- Status: Skip for now; maintenance appears insufficient
DreamBooth
- Similar to fine-tuning but adds information on top of the model without overwriting existing concepts
- Model size: ~2–7 GB (same as original model)
- Status: Prohibitive; requires loading full model and generates large output
LoRA
"Low-Rank Adaptation of Large Language Models" — injects trainable layers into cross-attention layers.
- Highly flexible but memory-intensive; limits training on typical GPUs
- Multiple incompatible implementations; choose one carefully
- Model size: ~5 MB to full model size; typical: ~150–300 MB