Offload
Offload moves a model, or model parts, between GPU memory (VRAM) and system memory (RAM). This reduces VRAM usage and helps run larger models on lower-VRAM GPUs.
Offload Mode
Tip
Offload mode is set by the Settings -> Models & Loading -> Model offload mode
Balanced
Balanced offload works differently than all other offloading methods as it performs offloading only when the VRAM usage exceeds the user-specified threshold.
- Recommended for compatible high-VRAM GPUs.
- Faster than other offload modes, but requires platform compatibility and enough VRAM.
- Moves model parts based on user-defined thresholds, so you can control VRAM usage.
- The high threshold sets the maximum memory usage allowed for weights of a single model component.
- The low threshold controls when unused model parts are offloaded back to RAM. If VRAM usage is above the low threshold, offloading runs. Otherwise, it does nothing.
- Configure thresholds in Settings -> Models & Loading -> Balanced offload GPU high / low watermark.
Balanced offloading default behavior is based on detected GPU memory:
- default: offload=balanced gpu-min=0.2 gpu-max=0.6 gc-threshold=0.7
- <= 4gb/lowvram: offload=sequential quantization=cpu vae-tiling=on gc-threshold=0.0
- <= 12gb/medvram: offload=balanced gpu-min=0.0 vae-tiling=on
- >= 24gb/highvram: offload=balanced gpu-max=0.8 never=clip-l,clip-g,vae
Warning
Not compatible with Optimum.Quanto qint quantization
Sequential
Works layer by layer for each model component marked as offload-compatible.
- Recommended for low-VRAM GPUs.
- Much slower, but can run large models such as FLUX on GPUs with 2-4GB VRAM.
Warning
Not compatible with Quanto qint or BitsAndBytes nf4 quantization
[!NOTE]
Using --lowvram automatically enables sequential offload.
Model
Works at the model-component level by offloading components marked as offload-compatible. Examples include VAE and text encoder.
- Recommended for medium VRAM when balanced offload is not compatible.
- More compatible than balanced or sequential, but with lower memory savings.
Limitations: N/A
Performance Notes
- Tested using SDXL with 2 large LoRA models
- Sequential offload is default for GPUs with 4GB or less
- Balanced offload is default for GPUs with more than 4GB. Balanced offload is slower than no offload, but it enables large models such as SD35 and FLUX.1 out of the box.
- Balanced offload set to default values
- LoRA overhead is measured in sec for first and subsequent iterations
- LoRA mode=backup can use up to 2x system memory. On large models such as SD35 or FLUX.1, this can be prohibitive.
| Offload mode | LoRA type | LoRA mode | LoRA overhead | End-to-end it/s | Note |
|---|---|---|---|---|---|
| none | none | N/A | N/A | 6.7 | fastest inference |
| balanced | none | N/A | N/A | 4.5 | default without LoRA |
| sequential | none | N/A | N/A | 0.6 | lowvram |
| none | native | backup | 1.8 / 0.0 | 6.0 | |
| balanced | native | backup | 1.3 / 0.0 | 2.8 | |
| sequential | native | backup | 5.8 / 0.0 | 0.5 | |
| none | native | fuse | 1.3 / 1.3 | 4.8 | |
| balanced | native | fuse | 2.8 / 2.5 | 3.1 | default with LoRA |
| sequential | native | fuse | 8.8 / 7.7 | 0.4 | |
| none | diffusers | default | 2.9 / 2.9 | 3.8 | |
| balanced | diffusers | default | 2.2 / 2.2 | 2.1 | |
| sequential | diffusers | default | 4.6 / 4.6 | 0.3 | |
| none | diffusers | fuse | 5.7 / 5.7 | 2.0 | |
| balanced | diffusers | fuse | N/A | did not complete | |
| sequential | diffusers | fuse | N/A | did not complete |