StableDiffusion-XL

Downloading SD-XL

Download these two files from Hugging Face and place them in your checkpoint directory (a subfolder is recommended).

Setup for SD-XL

To make SD-XL and refiner/backend/pipeline switching easier, select the following items in Settings -> User Interface:

After selecting them, click Apply settings, then Restart server. When the server is active again and the page reloads, the Quicksettings bar should look like this (assuming SDXL is selected):

VRAM Optimization

There are 3 memory optimization methods for the Diffusers backend (and SDXL): Model Shuffle, Medvram, and Lowvram.
Choose one based on your GPU, VRAM, and target batch size.

Note: VAE Tiling can save additional VRAM, but VAE Slicing is generally recommended when VRAM is limited.
Enable attention slicing should generally not be used, as the performance impact is significant.

Option 1: Model Shuffle

"Model Shuffle" dynamically moves model parts between GPU and CPU to use VRAM more efficiently.
Enable it by turning on these 3 options in Diffusers settings:

Move the base model to CPU when using the refiner.
Move the refiner model to CPU when not in use.
Move the UNet to CPU during VAE decoding.

To use Model Shuffling, do not enable --medvram or --lowvram, then apply the following settings:

The key options are the 3 Move checkboxes.

If you activate either CPU model offload or Sequential CPU offload, Model Shuffling is deactivated and ignored.
VRAM Usage: "Model Shuffle" will work in 8 GB of VRAM.

Option 2: MEDVRAM

If you have a 6GB VRAM GPU, or need larger SD-XL batches, use --medvram.
This significantly reduces VRAM requirements at the cost of inference speed.
Cannot be used with --lowvram/Sequential CPU offloading
Note: Until some upstream fixes go in, this will not work with DML or MAC.

Alternatively, you can enable the Enable model CPU offload checkbox in the Settings tab on the Diffusers settings page:

Model CPU Offload (same as --medvram)
VAE slicing (recommended)
Attention slicing is NOT recommended.

VRAM Usage: "Model CPU Offload" can work in 6 GB of VRAM.

Note: --medvram supersedes Model Shuffle options (Move base model, refiner model, UNet), and cannot be used with --lowvram/Sequential CPU offload.

Option 3: LOWVRAM

If your GPU has as little as 2GB VRAM, start SD.Next with --lowvram to vastly reduce VRAM requirements at an even larger speed cost.
This is effectively Enable Sequential CPU offload.

Note: VAE slicing, VAE tiling, and Attention slicing are all enabled by --lowvram regardless of the checkboxes.

Using this setting on higher-VRAM GPUs makes generation slower, but allows very large SD-XL batches, up to 24 on a 12GB GPU.

Note: Until some upstream fixes go in, this will not work with SDXL LoRA's and SD 1.5.

For further gains, continue below and configure SD.Next with the Fixed FP16 VAE.

Fixed FP16 VAE

It is currently recommended to use a Fixed FP16 VAE instead of built-in SD-XL base/refiner VAEs.
This can significantly reduce VRAM usage (from ~6GB to <1GB for VAE work) and roughly double VAE processing speed.

Below are the instructions for installation and use:

Download Fixed FP16 VAE to your VAE folder.
In your Settings tab, go to Diffusers settings and set VAE Upcasting to False and hit Apply.
Select the your VAE and simply Reload Checkpoint to reload the model or hit Restart server.

You should be good to go.

Using SD-XL

Select Autodetect or Stable Diffusion XL from the Pipeline dropdown.
Select sd_xl_base_1.0.safetensors from the Checkpoint dropdown.
Optional: select sd_xl_refiner_1.0.safetensors from the Refiner dropdown.

Using SD-XL Refiner

To use refiner, load it first, then enable it with Second pass in the UI. Using refiner is optional; the base model can already produce very good results.

Refiner can be used in two modes: traditional workflow, or early handover from base to refiner.
In either case, refiner will use calculated number of steps based on Refiner steps.

If denoise start is set to 0 or 1, then traditional workflow is used:

Base model runs from 0 -> 100% using Sampling steps.
Refiner model runs from 0 -> 100% using Refiner steps.

In this mode, refiner may not improve quality much and often only smooths the image, because the base model already reached 100% and little noise remains.

If refiner start is set to any other value, then handover mode is used:

Base model runs from 0% -> denoise_start%
Exact number is calculated internally to be Sampling steps.
Refiner model runs from denoise_start% -> 100%
Exact number is calculated internally to be Refiner steps.

In this mode, different primary/refiner step ratios are allowed, but may produce unexpected results because base and refiner operations are not perfectly aligned.

Note on steps vs timesteps: steps do not directly equal internal operations.
Steps are used to calculate execution points. For example, steps=6 roughly means denoising at 0% -> 20% -> 40% -> 60% -> 80% -> 100%. For that reason, values above 99 are not meaningful.