Skip to content

Offload

Offload moves a model, or model parts, between GPU memory (VRAM) and system memory (RAM). This reduces VRAM usage and helps run larger models on lower-VRAM GPUs.

Offload Mode

Tip

Offload mode is set by the Settings -> Models & Loading -> Model offload mode

Balanced

Balanced offload works differently than all other offloading methods as it performs offloading only when the VRAM usage exceeds the user-specified threshold.

  • Recommended for compatible high-VRAM GPUs.
  • Faster than other offload modes, but requires platform compatibility and enough VRAM.
  • Moves model parts based on user-defined thresholds, so you can control VRAM usage.
  • The high threshold sets the maximum memory usage allowed for weights of a single model component.
  • The low threshold controls when unused model parts are offloaded back to RAM. If VRAM usage is above the low threshold, offloading runs. Otherwise, it does nothing.
  • Configure thresholds in Settings -> Models & Loading -> Balanced offload GPU high / low watermark.

Balanced offloading default behavior is based on detected GPU memory: - default: offload=balanced gpu-min=0.2 gpu-max=0.6 gc-threshold=0.7
- <= 4gb/lowvram: offload=sequential quantization=cpu vae-tiling=on gc-threshold=0.0
- <= 12gb/medvram: offload=balanced gpu-min=0.0 vae-tiling=on
- >= 24gb/highvram: offload=balanced gpu-max=0.8 never=clip-l,clip-g,vae

Warning

Not compatible with Optimum.Quanto qint quantization

Sequential

Works layer by layer for each model component marked as offload-compatible.

  • Recommended for low-VRAM GPUs.
  • Much slower, but can run large models such as FLUX on GPUs with 2-4GB VRAM.

Warning

Not compatible with Quanto qint or BitsAndBytes nf4 quantization

[!NOTE] Using --lowvram automatically enables sequential offload.

Model

Works at the model-component level by offloading components marked as offload-compatible. Examples include VAE and text encoder.

  • Recommended for medium VRAM when balanced offload is not compatible.
  • More compatible than balanced or sequential, but with lower memory savings.

Limitations: N/A

Performance Notes

  • Tested using SDXL with 2 large LoRA models
  • Sequential offload is default for GPUs with 4GB or less
  • Balanced offload is default for GPUs with more than 4GB. Balanced offload is slower than no offload, but it enables large models such as SD35 and FLUX.1 out of the box.
  • Balanced offload set to default values
  • LoRA overhead is measured in sec for first and subsequent iterations
  • LoRA mode=backup can use up to 2x system memory. On large models such as SD35 or FLUX.1, this can be prohibitive.
Offload mode LoRA type LoRA mode LoRA overhead End-to-end it/s Note
none none N/A N/A 6.7 fastest inference
balanced none N/A N/A 4.5 default without LoRA
sequential none N/A N/A 0.6 lowvram
none native backup 1.8 / 0.0 6.0
balanced native backup 1.3 / 0.0 2.8
sequential native backup 5.8 / 0.0 0.5
none native fuse 1.3 / 1.3 4.8
balanced native fuse 2.8 / 2.5 3.1 default with LoRA
sequential native fuse 8.8 / 7.7 0.4
none diffusers default 2.9 / 2.9 3.8
balanced diffusers default 2.2 / 2.2 2.1
sequential diffusers default 4.6 / 4.6 0.3
none diffusers fuse 5.7 / 5.7 2.0
balanced diffusers fuse N/A did not complete
sequential diffusers fuse N/A did not complete