Performance Tuning

Introduction

This guide covers practical ways to improve SD.Next throughput (it/s) across different hardware.

Performance tuning is highly system-dependent. Results vary by GPU generation, VRAM size, OS, selected backend, and inference platform (CUDA, ROCm, ONNX Runtime/Olive, ZLUDA).

Some combinations are unstable or incompatible on certain setups. Treat recommendations in this page as starting points, then test on your own system.

Feedback is very helpful. If you report issues with logs and screenshots, the team can usually improve defaults and compatibility over time.

Compute Settings

Note: Changing any of the settings on this page will require you to, at the least, unload the model and then reload it (after hitting apply!), as these settings are applied on model load, not in realtime.

Generally speaking, for most GPUs our user-base has (mostly Nvidia on Windows judging by discord roles, so using CUDA), you are going to want the settings below (BF16 is possible too if using 30xx+).

Good settings:

Best-compute1

Bad settings:

Bad-compute1

In general, only use the selected "bad" settings when needed for compatibility. They are slower and usually use more memory. OpenVINO is an exception: values often appear as fp32 there due to implementation details.

If you see artifacts (for example square blocks), especially on older GPUs, try these settings. Prefer upcast sampling over --no-half when possible. Apply changes one at a time: unload model, reload model, then test generation.

CopyQ lQSCVs

Model Compile

To use model compile options, enable at least one target: Model, VAE, Text Encoder, or Upscaler. Compiling Text Encoder usually has limited benefit. Compiling Model and VAE typically gives the largest speed gains.

Stable-fast

Stable-fast is one model compile option and can provide a significant speed-up on supported setups (typically NVIDIA, sometimes ZLUDA).

Install it from SD.Next root after activating your venv: python cli\install-sf.py

You can install Stable-fast while SD.Next is already running in another terminal. SD.Next attempts to load it when selected, so a restart is not always required.

OneDiff

Linux (and possibly macOS) users can also try OneDiff, which may outperform Stable-fast. Install with pip install -U onediff from an active SD.Next venv.

NOTE: Do not compile Text Encoder with OneDiff. It is typically slower.

If you want to thank anyone for OneDiff support, hit up @aifartist on our Discord.

Inference options

These options are usually the fastest way to improve performance.

Token Merging (ToMe)

Sadly ToMe does not work at the same time as Hypertile. It will be disabled if Hypertile is enabled because Hypertile is faster. If hypertile works for you right now, don't even bother touching this.

Token Merging (ToMe) can still improve speed and memory use. Typical values are 0.3 to 0.4; higher values can be faster but may reduce quality. Use XYZ testing to find a workable value for your workflow.

Default settings:

CopyQ blZizQ

Suggested settings:

CopyQ HmmAXj

Honestly, 0.5 and up is the real performance gain, but you test and decide yourself.

Bear in mind that it does have a quality impact on your image, greater the higher the setting, and will make perfect reproduction from the same seed and prompt impossible afaik.

A newer implementation in the same space, ToDo, is planned for a future release.

Hypertile

Overrides and is incompatible with ToMe, also can cause issues with some platforms, so if you get errors after turning this on, that might be why. Using Hypertile VAE might also cause issues, so try on and off. Hypertile is a much preferable option to Token Merging at the moment.

In most cases, enabling Hypertile UNet is enough. Default values are usually fine. When tile size is 0, it auto-adjusts to half of the shortest image side.

Hypertile VAE can cause issues on some setups and is usually not worth enabling.

You can tune swap size and UNet depth, but observed gains are usually small.

Default settings:

CopyQ gwehbW

Suggested settings:

Hypertile-Suggested

Also has a quality impact, but I've never measured it personally

Other settings

Parallel process images in batch is intended for img2img batch mode.

Without it, batch size = n usually generates n images per input. With it, SD.Next generates one image per input and processes n inputs in parallel.

CopyQ IARldt