Offload

Offload moves a model, or model parts, between GPU memory (VRAM) and system memory (RAM). This reduces VRAM usage and helps run larger models on lower-VRAM GPUs.

Offload Mode

Tip

Offload mode is set by the Settings -> Models & Loading -> Model offload mode

Balanced

Balanced offload works differently than all other offloading methods as it performs offloading only when the VRAM usage exceeds the user-specified threshold.

The default mode for any detected GPU, suitable across the memory range.
Faster than other offload modes, but requires platform compatibility and enough VRAM.
Model parts move based on configured thresholds, keeping VRAM usage within the limit.
The high threshold sets the maximum memory usage allowed for weights of a single model component.
The low threshold controls when unused model parts are offloaded back to RAM. If VRAM usage is above the low threshold, offloading runs. Otherwise, it does nothing.
Thresholds are configured under Settings -> Models & Loading -> Offload low watermark / Offload GPU high watermark.

Defaults are selected from detected GPU memory. Every detected GPU uses balanced offload; only the explicit --lowvram flag switches to sequential, and --medvram forces balanced with a low watermark of 0. - 12-24 GB (default): balanced, low/high watermark 0.2 / 0.6
- 4-12 GB: balanced, low watermark 0.0, large text encoders always offloaded
- 4 GB or less: balanced, low watermark 0.0, plus the --lowvram low-memory optimizations (VAE tiling, aggressive garbage collection)
- 24 GB or more: balanced, high watermark 0.8, large text encoders always offloaded, CLIP and VAE never offloaded

Warning

Not compatible with Optimum.Quanto qint quantization

Sequential

Works layer by layer for each model component marked as offload-compatible.

Recommended for very low VRAM, around 4 GB or less.
Much slower, but can run large models such as FLUX on GPUs with 2-4GB VRAM.

Warning

Not compatible with Quanto qint or BitsAndBytes nf4 quantization

[!NOTE] Using --lowvram automatically enables sequential offload.

Model

Works at the model-component level by offloading components marked as offload-compatible. Examples include VAE and text encoder.

Recommended when balanced offload is not compatible.
More compatible than balanced or sequential, but with lower memory savings.

Limitations: N/A

Performance Notes

Tested using SDXL with 2 large LoRA models
Balanced offload is the default for all detected GPUs; sequential offload applies only with --lowvram. Balanced offload is slower than no offload, but it enables large models such as SD35 and FLUX.1 out of the box.
Balanced offload set to default values
LoRA overhead is measured in sec for first and subsequent iterations
LoRA mode=backup can use up to 2x system memory. On large models such as SD35 or FLUX.1, this can be prohibitive.

Offload mode	LoRA type	LoRA mode	LoRA overhead	End-to-end it/s	Note
none	none	N/A	N/A	6.7	fastest inference
balanced	none	N/A	N/A	4.5	default without LoRA
sequential	none	N/A	N/A	0.6	lowvram
none	native	backup	1.8 / 0.0	6.0
balanced	native	backup	1.3 / 0.0	2.8
sequential	native	backup	5.8 / 0.0	0.5
none	native	fuse	1.3 / 1.3	4.8
balanced	native	fuse	2.8 / 2.5	3.1	default with LoRA
sequential	native	fuse	8.8 / 7.7	0.4
none	diffusers	default	2.9 / 2.9	3.8
balanced	diffusers	default	2.2 / 2.2	2.1
sequential	diffusers	default	4.6 / 4.6	0.3
none	diffusers	fuse	5.7 / 5.7	2.0
balanced	diffusers	fuse	N/A		did not complete
sequential	diffusers	fuse	N/A		did not complete