Offload

Offload is a method of moving model or parts of the model between the GPU memory (VRAM) and system memory (RAM) in order to reduce the memory footprint of the model and allow it to run on GPUs with lower VRAM.

Offload Mode

Tip

Offload mode is set by the Settings -> Models & Loading -> Model offload mode

Balanced

Balanced offload works differently than all other offloading methods as it performs offloading only when the VRAM usage exceeds the user-specified threshold.

Recommended for compatible high VRAM GPUs
Faster but requires compatible platform and sufficient VRAM
Balanced offload moves parts of the model depending on the user-specified threshold allowing to control how much VRAM is to be used
High threshold will set the maximum memory usage allowed for the model weights of a single model component
Low threshold will decide when to offload unused models back to RAM
If the VRAM usage is higher than the low threshold, it will offload, otherwise it will do nothing
Configure threshold in Settings -> Models & Loading -> Balanced offload GPU high / low watermark

Balanced offloading default behavior is based on detected GPU memory: - default: offload=balanced gpu-min=0.2 gpu-max=0.6 gc-threshold=0.7
- <= 4gb/lowvram: offload=sequential quantization=cpu vae-tiling=on gc-threshold=0.0
- <= 12gb/medvram: offload=balanced gpu-min=0.0 vae-tiling=on
- >= 24gb/highvram: offload=balanced gpu-max=0.8 never=clip-l,clip-g,vae

Warning

Not compatible with Optimum.Quanto qint quantization

Sequential

Works on layer-by-layer basis of each model component that is marked as offload-compatible

Recommended for low VRAM GPUs
Much slower but allows to run large models such as FLUX even on GPUs with 2-4GB VRAM

Warning

Not compatible with Quanto qint or BitsAndBytes nf4 quantization

Note

Use of --lowvram automatically triggers use of sequenential offload

Model

Works on model component level by offloading components that are marked as offload-compatible
For example, VAE, text-encoder, etc.

Recommended for medium when balanced offload is not compatible
Higher compatibility than either balanced and sequential, but lesser savings

Limitations: N/A

Performance Notes

Tested using SDXL with 2 large LoRA models
Sequential offload is default for GPUs with 4GB or less
Balanced offload is default for GPUs with more than 4GB
Balanced offload is slower than no offload, but allows using large models such as SD35 and FLUX.1 out-of-the-box
Balanced offload set to default values
LoRA overhead is measured in sec for first and subsequent iterations
LoRA mode=backup can use up to 2x system memory
Using backup can be prohibitive on large models such as SD35 or FLUX.1

Offload mode	LoRA type	LoRA mode	LoRA overhead	End-to-end it/s	Note
none	none	N/A	N/A	6.7	fastest inference
balanced	none	N/A	N/A	4.5	default without LoRA
sequential	none	N/A	N/A	0.6	lowvram
none	native	backup	1.8 / 0.0	6.0
balanced	native	backup	1.3 / 0.0	2.8
sequential	native	backup	5.8 / 0.0	0.5
none	native	fuse	1.3 / 1.3	4.8
balanced	native	fuse	2.8 / 2.5	3.1	default with LoRA
sequential	native	fuse	8.8 / 7.7	0.4
none	diffusers	default	2.9 / 2.9	3.8
balanced	diffusers	default	2.2 / 2.2	2.1
sequential	diffusers	default	4.6 / 4.6	0.3
none	diffusers	fuse	5.7 / 5.7	2.0
balanced	diffusers	fuse	N/A		did not complete
sequential	diffusers	fuse	N/A		did not complete