Quantization

Quantization is a process of:

Storage-optimization
reducing the memory footprint of the model by reducing the precision of parameters in a model
Compute-optimization
speed up the inference process by providing optimized kernels for native execution in quantized precision

For storage-only quantization, the model is quantized to lower precision but the operations are still performed in the original precision
which means that each operation needs to be upcasted to the original precision before execution resulting in a performance overhead

Important

Quantization considerations

Before deciding which quantization method to use, you need to consider the following:

Compatibility with your platform
Some quantization methods are not available on all platforms, see below for details
Performance benefits
Some quantization methods may provide significant performance benefits on certain platforms
Quality trade-offs
Some quantization methods may result in a loss of quality

Using Quantized Models

Quantization can be done in multiple ways:

On-the-fly by quantizing on-the-fly during model load
Available by selecting settings -> quantization for some quantization types
Sometimes referred to as pre mode
This is recommended for most users!
By quantizing immediately after model load
Available by selecting settings -> quantization for all quantization types
Sometimes referred to as post mode
By simply loading a pre-quantized model
Quantization type will be auto-determined at the start of the load
During model training itself
Out-of-scope for this document

On-the-Fly Quantization

On-the-fly quantization is available for SDNQ, BitsAndBytes, Optimum.Quanto, TorchAO and Layerwise quantization methods
and can be configured in Settings -> Quantization Settings

You can specify quantization for each model component:

Model
Applies to the UNet / Transformer part of the Diffusion Models
TE
Applies to text-encoders
LLM
Applies to VLM models during captioning and interrogate and prompt enhance features
Control
Applies to ControlNets
VAE
Applies to VAE, quantization of VAE module is not recommended

You can mix-and-match quantization types for each model component For example, you can use BitsAndBytes for the transformer and Optimum.Quanto for the llm

Quantization Engines

SD.Next supports multiple quantization engines, each with multiple quantization schemes:

SDNQ 15 int-based and 4 float8-based quantization schemes
BitsAndBytes 3 float-based quantization schemes
Optimium.Quanto 3 int-based and 2 float-based quantization schemes
TorchAO: 4 int-based and 3 float-based quantization schemes
Layerwise 2 float8-based quantization schemes
GGUF with pre-quantized weights

Important

Not all quantization engines are available on all platforms, see notes below for details!
Using any quantization engine for the first time may result in failure as required libraries are downloaded and installed
Restart SD.Next and try again if you encounter any issues

Tip

If you're on Windows with a compatible GPU, you may try WSL2 for broader feature compatibiliy
See WSL Wiki for more details

SDNQ

SD.Next Quantization provides full cross-platform quantization for any device and SDNQ is the most versatile choice

Note

Advantage of SDNQ is that it does work on any platform with good performance!

broad platform and GPU support
balanced offload and Lora support
quantized matmul support
on-the-fly fast quantization support
enable in Settings -> Quantization Settings -> SDNQ
see SDNQ Wiki for more details

BitsAndBytes

Typical models pre-quantized with bitsandbytes would have look like *nf4.safetensors or *fp8.safetensors

Note

BnB is allows for usage of balanced offload as well as fast quantization on-the-fly during load but it is not available on all platforms.

Limitations:

default bitsandbytes package only supports nVidia GPUs
some quantization types require newer GPU with supported CUDA ops: e.g. nVidia Turing GPUs or newer
bitsandbytes relies on triton packages which are not available on windows unless manually compiled/installed
without them, performance is significantly reduced
for nVidia: automatically installed as needed
for AMD/ROCm: link
for Intel/IPEX: link

Optimum-Quanto

Typical models pre-quantized with optimum.quanto would have look like *qint.safetensors.

Note

OQ is efficient with its qint8/qint4 quantization types, but it cannot be used with broad offloading methods

Limitations:

requires torch>=2.4.0
if you're running older torch, you can try upgrading it or running sdnext with --reinstall flag
not compatible with balanced offload

TorchAO

TorchAO is available for quantization on-the-fly during model load as well as post-load quantization
Limitations:
- Requires torch>=2.5.0
- int4 based quantization cannot be used with any offload method

GGUF

GGUF is a binary file format used to package pre-quantized models.

GGUF is originally desiged by llama.cpp project and intended to be used with its GGML execution runtime.
However, without GGML, GGUF provides storage-only quantization which means that every operation needs to be upcast to current device precision before execution (typically FP16 or BF16) which comes with a significant performance overhead.

Warning

Right now, all popular T2I inference UIs (SD.Next, Forge, ComfyUI, InvokeAI etc.) are using GGUF as storage-only and as such usage of GGUF is not recommended!

gguf supports wide range of quantization types and is not platform or GPU dependent
gguf does not provide native GPU kernels which means that gguf is purely a storage optimization
gguf reduces model size and memory usage, but it does slow down model inference since all quantized weights are de-quantized on-the-fly

Limitations:

gguf is not compatible with model offloading as it would trigger de-quantization
note: only supported component in gguf binary format is UNET/Transformer
you cannot load all-in-one single-file GGUF model

Benchmarks

Comparing performance of different quantization methods on the FLUX.1-Dev model
This is not a comprehensive benchmark, but rather a quick overview of the performance of different quantization methods

Environment

model==FLUX.1-Dev
gpu==nVidia RTX4090
torch==2.7.1
dtype==auto
attention==sdpa
offload==balanced sampler==FlowMatchEulerDiscreteScheduler
resolution=1024x1024
steps==20
batch==1

Engine	DType	It/s	Note
Torch	Float16	0.68
Torch	BFloat16	0.68	Default
BnB	NF4	1.48
BnB	FP8	1.42
SDNQ	INT8	1.33
SDNQ	INT8	1.44	Dequantize using torch.compile
SDNQ	INT8	2.08	Dequantize & MatMul
SDNQ	FP8 E4M	1.33
SDNQ	FP8 E4M	1.44	Dequantize using torch.compile
SDNQ	FP8 E4M	1.88	Dequantize & MatMul
SDNQ	INT7	1.10
SDNQ	INT7	1.37	Dequantize using torch.compile
SDNQ	INT7	1.96	Dequantize & MatMul
SDNQ	INT6	1.16
SDNQ	INT6	1.42	Dequantize using torch.compile
SDNQ	INT6	2.04	Dequantize & MatMul
SDNQ	INT5	1.14
SDNQ	INT5	1.42	Dequantize using torch.compile
SDNQ	INT5	2.04	Dequantize & MatMul
SDNQ	UINT4	1.28
SDNQ	UINT4	1.48	Dequantize using torch.compile
SDNQ	UINT3	1.20
SDNQ	UINT3	1.42	Dequantize using torch.compile
Quanto	INT8	1.23	offload=model
TorchAO	INT8	0.65	Weights-only
TorchAO	INT8	1.20	Weights-only & offload=model
Layerwise	FP8 E4M	1.37
SVDQuant	INT4	4.75	Nunchaku pre-compiled
SVDQuant	INT4	5.65	Nunchaku pre-compiled + attention

Recommendations

SDNQ is the best fully cross-platform option and brings good memory savings with optional compute optimizations using Triton

Other Options

For CUDA environments
BitsAndBytes is another versatile option and nf4 is an efficient quantization type from both memory and compute perspective

For ROCm, ZLUDA or IPEX environments
Optimum.Quanto is another option if you can fit the model into VRAM either using Model offload or without using any offload as its not compatible with the default Balanced offload

Errors

Caution

Using incompatible configurations will result in errors during model load:

BitsAndBytes nf4 quantization is not compatible with sequential offload

Error: Blockwise quantization only supports 16/32-bit floats
Quanto qint quantization is not compatible with balanced offload

Error: QBytesTensor.__new__() missing 5 required positional arguments
Quanto qint quantization is not compatible with sequential offload

Error: Expected all tensors to be on the same device

Triton

Many quantization schemes rely on Triton compiler for Torch which is not available on all platforms If your installation fails, you can try building triton from sources or find pre-build binary wheels

Triton for Windows

A Triton fork is available for Windows and can be installed by running the following PowerShell script from your SD.Next installation folder:

install-triton.ps1

$ErrorActionPreference = "Stop"

$VENV_DIR = if ($env:VENV_DIR) { $env:VENV_DIR } else { Resolve-Path "venv" }
$PYTHON = "$VENV_DIR\Scripts\python"
$PIP = "$VENV_DIR\Scripts\pip"

$sys_ver = & $PYTHON -VV
$sys_ver_major, $sys_ver_minor = $sys_ver.Split(" ")[1].Split(".")[0, 1]
$filename = "triton-3.2.0-cp$sys_ver_major$sys_ver_minor-cp$sys_ver_major$sys_ver_minor-win_amd64.whl"
$url = "https://github.com/woct0rdho/triton-windows/releases/download/v3.2.0-windows.post10/$filename"

Invoke-WebRequest $url -OutFile $filename
& $PIP install $filename
Remove-Item $filename