Quantization
Quantization is a process of:
- Storage-optimization
reducing the memory footprint of the model by reducing the precision of parameters in a model
- Compute-optimization
speed up the inference process by providing optimized kernels for native execution in quantized precision
For storage-only quantization, the model is quantized to lower precision but the operations are still performed in the original precision
which means that each operation needs to be upcasted to the original precision before execution resulting in a performance overhead
[!IMPORTANT] Before deciding which quantization method to use, you need to consider the following: - Compatibility with your platform
Some quantization methods are not available on all platforms, see below for details
- Performance benefits
Some quantization methods may provide significant performance benefits on certain platforms
- Quality trade-offs
Some quantization methods may result in a loss of quality
Using Quantized Models
Quantization can be done in multiple ways:
- On-the-fly by quantizing on-the-fly during model load
Available by selecting settings -> quantization for some quantization types
Sometimes referred to as pre
mode
This is recommended for most users!
- By quantizing immediately after model load
Available by selecting settings -> quantization for all quantization types
Sometimes referred to as post
mode
- By simply loading a pre-quantized model
Quantization type will be auto-determined at the start of the load
- During model training itself
Out-of-scope for this document
On-the-Fly Quantization
On-the-fly quantization is available for BitsAndBytes
, Optimum.Quanto
, TorchAO
, NNCF
and Layerwise
quantization methods
and can be configured in Settings -> Quantization Settings
You can specify quantization for each model component:
- Model
This is a shortcut, it basically means "select quantization to entire model"
- Transformer
Applies to quantization to main model DiT transformer, for example in SD3.5, HiDream, etc.
Does not apply to UNET as in SD15/SDXL as its not recommended for quantization
- Text-encoder
Applies to T5-like text-encoders, for example in SD35, FLUX.1, HiDream, etc.
- VAE
Typically quantization of VAE module is not recommended
- LLM
Used by models that rely on LLM for secondary text encoding as well as VLM models during captioning and interrogate and prompt enhance features
- Video
Specifically for video models and typically referrs to DiT module of video model
You can mix-and-match quantization types for each model component
For example, you can use BitsAndBytes
for the transformer and Optimum.Quanto
for the llm
Quantization Engines
SD.Next supports multiple quantization engines, each with multiple quantization schemes:
- BitsAndBytes
3 float-based quantization schemes
- Optimium.Quanto
3 int-based and 2 float-based quantization schemes
- TorchAO
: 4 int-based and 3 float-based quantization schemes
- NNCF
4 int-based quantization schemes
- Layerwise
2 float8 based quantization schemes
- GGUF
with pre-quantized weights
Important
Not all quantization engines are available on all platforms, see notes below for details!
Using any quantization engine for the first time may result in failure as required libraries are downloaded and installed
Restart SD.Next and try again if you encounter any issues
Tip
If you're on Windows with a compatible GPU, you may try WSL2 for broader feature compatibiliy
See WSL Wiki for more details
BitsAndBytes
Typical models pre-quantized with bitsandbytes
would have look like *nf4.safetensors
or *fp8.safetensors
Note
BnB is allows for usage of balanced offload as well as fast quantization on-the-fly during load, thus it is considered most versatile choice, but it is not available on all platforms.
Limitations:
- default bitsandbytes
package only supports nVidia GPUs
some quantization types require newer GPU with supported CUDA ops: e.g. nVidia Turing GPUs or newer
- bitsandbytes
relies on triton
packages which are not available on windows unless manually compiled/installed
without them, performance is significantly reduced
- for nVidia: automatically installed as needed
- for AMD/ROCm: link
- for Intel/IPEX: link
Optimum-Quanto
Typical models pre-quantized with optimum.quanto
would have look like *qint.safetensors
.
Note
OQ is highly efficient with its qint8/qint4 quantization types, but it cannot be used with broad offloading methods
Limitations:
- requires torch>=2.4.0
if you're running older torch, you can try upgrading it or running sdnext with --reinstall
flag
- not compatible with balanced offload
TorchAO
TorchAO is available for quantization on-the-fly during model load as well as post-load quantization
Limitations:
- Requires torch>=2.5.0
- int4 based quantization cannot be used with any offload method
NNCF
NNCF provides full cross-platform storage-only quantization (referred to as model compression) for any platform with PyTorch and more compute optimizations for OpenVINO platform
Note
Advantage of NNCF is that it does work on any platform: if you're having issues with optimum-quanto
or bitsandbytes
, try it out!
- broad platform and GPU support
- enable in Settings -> Quantization Settings -> NNCF
- see NNCF Wiki for more details
GGUF
GGUF is a binary file format used to package pre-quantized models.
GGUF is originally desiged by llama.cpp
project and intended to be used with its GGML execution runtime.
However, without GGML, GGUF provides storage-only quantization which means that every operation needs to be upcast to current device precision before execution (typically FP16 or BF16) which comes with a significant performance overhead.
Warning
Right now, all popular T2I inference UIs (SD.Next, Forge, ComfyUI, InvokeAI etc.) are using GGUF as storage-only and as such usage of GGUF is not recommended!
gguf
supports wide range of quantization types and is not platform or GPU dependentgguf
does not provide native GPU kernels which means thatgguf
is purely a storage optimizationgguf
reduces model size and memory usage, but it does slow down model inference since all quantized weights are de-quantized on-the-fly
Limitations:
- gguf
is not compatible with model offloading as it would trigger de-quantization
- note: only supported component in gguf
binary format is UNET/Transformer
you cannot load all-in-one single-file GGUF model
Benchmarks
Comparing performance of different quantization methods on the FLUX.1-Dev model
This is not a comprehensive benchmark, but rather a quick overview of the performance of different quantization methods
Environment
model==FLUX.1-Dev
gpu==nVidia RTX4090
torch==2.7.0
dtype==auto
attention==sdpa
offload==Balanced sampler==FlowMatchEulerDiscreteScheduler
resolution=1024x1024
batch==1
Engine | DType | It/s | Note |
---|---|---|---|
Torch | Float16 | 0.45 | |
Torch | BFloat16 | 0.48 | Default |
BnB | NF4 | 1.32 | |
BnB | FP8 | 1.05 | |
NNCF | INT8 | 1.07 | |
NNCF | INT8-SYM | 0.85 | |
NNCF | INT8-SYM | 1.17 | Decompress using torch.compile |
NNCF | INT8-SYM | 1.23 | Use direct INT8 MatMul |
NNCF | INT8-SYM | 1.46 | Decompress & MatMul |
Quanto | any | N/A | Balanced offload not supported |
TorchAO | INT8 | 0.51 | Weights-only |
Layerwise | FP8 E4M | 0.99 | |
SVDQuant | INT4 | 4.66 | Nunchaku pre-compiled |
Recommendations
CUDA
For CUDA environments
BitsAndBytes
is the most versatile and recommended option and
nf4
is the most efficient quantization type from both memory and compute perspective
Other
For ROCm, ZLUDA or IPEX environments
NNCF
is the best fully cross-platform option and brings good memory savings, but does not have native compute optimizations
Optimum.Quanto
is the faster option if you can fit the model into VRAM either using Model offload or without using any offload as its not compatible with the default Balanced offload
Errors
Caution
Using incompatible configurations will result in errors during model load:
- BitsAndBytes nf4 quantization is not compatible with sequential offload
Error: Blockwise quantization only supports 16/32-bit floats
- Quanto qint quantization is not compatible with balanced offload
Error: QBytesTensor.new() missing 5 required positional arguments
- Quanto qint quantization is not compatible with sequential offload
Error: Expected all tensors to be on the same device
Triton
Many quantization schemes rely on Triton compiler for Torch which is not available on all platforms
If your installation fails, you can try building triton
from sources or find pre-build binary wheels
Triton for Windows
A Triton fork is available for Windows and can be installed by running the following PowerShell script from your SD.Next installation folder:
install-triton.ps1
$ErrorActionPreference = "Stop"
$VENV_DIR = if ($env:VENV_DIR) { $env:VENV_DIR } else { Resolve-Path "venv" }
$PYTHON = "$VENV_DIR\Scripts\python"
$PIP = "$VENV_DIR\Scripts\pip"
$sys_ver = & $PYTHON -VV
$sys_ver_major, $sys_ver_minor = $sys_ver.Split(" ")[1].Split(".")[0, 1]
$filename = "triton-3.2.0-cp$sys_ver_major$sys_ver_minor-cp$sys_ver_major$sys_ver_minor-win_amd64.whl"
$url = "https://github.com/woct0rdho/triton-windows/releases/download/v3.2.0-windows.post10/$filename"
Invoke-WebRequest $url -OutFile $filename
& $PIP install $filename
Remove-Item $filename