SDNQ Quantization

SD.Next Quantization provides full cross-platform quantization to reduce memory usage and increase performance for any device.

Usage

Go into Settings -> Quantization Settings
Enable the desired Quantization options under the SDNQ menu
Model, TE and LLM are the main targets for most use cases
If model is already loaded, reload the model
Once quantization options are set, they will be applied to any model loaded after that

Features

SDNQ is fully cross-platform, supports all GPUs and CPUs and includes many quantization methods:
Supports bit rates all the way from 1-bit to 16-bit
Supports integer, unsigned integer, floating point, unsigned floating point formats and their variants
Supports 33 int-based and 143 float-based quantization schemes totaling to 176 supported quantization schemes
Supports nearly all model types
Supports compute optimizations using Triton via torch.compile
Supports Quantized MatMul with significant speedups on INT8, FP8 or FP16 supported GPUs
Supports on the fly quantization during model load with little to no overhead (called as pre mode)
Supports quantization for the convolutional layers with UNet models
Supports post load quantization for any model
Supports on the fly usage of LoRa models
Supports SVD Quantization
Supports balanced offload

Benchmarks are available in the Quantization Wiki.

Recommended Options

Dequantize using torch.compile
Highly recommended for much better performance if Triton is available
Use Quantized MatMul
Recommended for much better performance if Triton is available on supported GPUs
Supported GPUs for quantized matmul are listed in the Use Quantized MatMul section.
Recommended quantization dtype is INT8 for its fast speed and almost no loss in quality
You can use INT6 with little quality loss to save more memory and UINT4 to save even more memory
float8_e4m3fn is another option for fast speed and high quality but FP8 has slightly lower quality and performance than INT8

Triton

Triton enables the use of optimized kernels for much better performance.
Triton is not required for SDNQ but it is highly recommended for much better performance.
SDNQ will use Triton by default via torch.compile if Triton is available. You can override this with dequantize using torch.compile option.

Triton with Nvidia

Linux
Triton comes built-in on Linux, you can use the Triton optimizations out of the box.
Windows
Windows requires manual installation of Triton.
Installation steps are available in the Quantization Wiki

Triton with AMD

Linux
Triton comes built-in on Linux, you can use the Triton optimizations out of the box.
Windows
Windows requires manual installation of Triton and not guaranteed to work with Zluda.
Experimental installation steps are available in the ZLUDA Wiki

Triton with Intel

Triton comes built-in with Intel on both Windows and Linux, you can use the Triton optimizations out of the box.
Windows might require additional installation of MSVC if it is not already installed and activated.
Installation steps are available in the PyTorch Inductor Windows wiki

Options

Quantization enabled

Used to decide which parts of the model will get quantized.
Recommended options are Model and TE.
Default is none.

Model is used quantize the Diffusion Models.
TE is used to quantize the Text Encoders.
LLM is used to quantize the LLMs with Prompt Enhance.
Control is used to quantize ControlNets.
VAE is used to quantize the VAE. Using the VAE option is not recommended.

Note

VAE Upcast has to be set to false if you use the VAE option with FP16.
If you get black images with SDXL models, use the FP16 Fixed VAE.

Quantization mode

Used to decide when the quantization step will happen on model load.
Default is auto.

Auto mode will choose pre or post automatically depending on the model.
Pre mode will quantize the model while the model is loading. Reduces system RAM usage.
Post mode will quantize the model after the model is loaded into system RAM.

Pre mode is compatible with DiT and Video models like Flux but older UNet models like SDXL are only compatible with post mode.

Quantization type

Used to decide the data type used to store the model weights.
Recommended types are int8 for 8 bit, int6 for 6 bit and uint4 for 4 bit.
Default is int8.

INT8 quants have very similar quality to the full 16 bit precision while using 2 times less memory.
INT6 quants are the middle ground. Similar quality to to the full 16 bit precision while using 2.7 times less memory.
UINT4 quants have lower quality and less performance but uses 3.6 times less memory.
FP8 quants have similar quality to INT6 but with the same memory usage as INT8.

Unsigned quants have the extra u added to the start of their name while the symmetric quants don't have any prefix.

Asymmetric quants uses unsigned integers, meaning they can't store negative values and will use another variable called zero point for this purpose.
Symmetric quants can store negative and positive values meaning they don't have extra zero point value and they run faster than unsigned quants because of this.

Quality difference between asymmetric and symmetric quantization is very small for 8 to 6 bits but you should use asymmetric methods below 5 bits.

int16 uses int16 and has -32768 to 32767 range.
int15 uses sixteen int15 values packed into fifteen int16 values and has -16384 to 16383 range.
int14 uses eight int14 values packed into seven int16 values and has -8192 to 8191 range.
int13 uses sixteen int13 values packed into thirteen int16 values and has -4096 to 4095 range.
int12 uses four int12 values packed into three int16 values and has -2048 to 2047 range.
int11 uses sixteen int11 values packed into eleven int16 values and has -1024 to 1023 range.
int10 uses eight int10 values packed into five int16 values and has -512 to 511 range.
int9 uses sixteen int9 values packed into nine int16 values and has -256 to 255 range.
int8 uses int8 and has -128 to 127 range.
int7 uses eight int7 values packed into seven uint8 values and has -64 to 63 range.
int6 uses four int6 values packed into three uint8 values and has -32 to 31 range.
int5 uses eight int5 values packed into five uint8 values and has -16 to 15 range.
int4 uses two int4 values packed into a single uint8 value and has -8 to 7 range.
int3 uses eight int3 values packed into a three uint8 values and has -4 to 3 range.
int2 uses four int2 values packed into a single uint8 value and has -2 to 1 range.
uint16 uses uint16 and has 0 to 65535 range.
uint15 uses sixteen uint15 values packed into fifteen int16 values and has 0 to 32768 range.
uint14 uses eight uint14 values packed into seven int16 values and has 0 to 16384 range.
uint13 uses sixteen uint13 values packed into thirteen int16 values and has 0 to 8192 range.
uint12 uses four uint12 values packed into three int16 values and has 0 to 4096 range.
uint11 uses sixteen uint11 values packed into eleven int16 values and has 0 to 2048 range.
uint10 uses eight uint10 values packed into five int16 values and has 0 to 1024 range.
uint9 uses sixteen uint9 values packed into nine int16 values and has 0 to 512 range.
uint8 uses uint8 and has 0 to 255 range.
uint7 uses eight uint7 values packed into seven uint8 values and has 0 to 127 range.
uint6 uses four uint6 values packed into three uint8 values and has 0 to 63 range.
uint5 uses eight uint5 values packed into five uint8 values and has 0 to 31 range.
uint4 uses two uint4 values packed into a single uint8 value and has 0 to 15 range.
uint3 uses eight uint3 values packed into a three uint8 value and has 0 to 7 range.
uint2 uses four uint2 values packed into a single uint8 value and has 0 to 3 range.
uint1 uses eight uint1 values packed into a single uint8 value and has 0 to 1 range.
float16 uses float16 and has -65504 to 65504 range.
float15_e5m9fn uses float15_e5m9fn packed into uint15 and has -130944 to 130944 range.
float14_e5m8fn uses float14_e5m8fn packed into uint14 and has -130816 to 130816 range.
float13_e5m7fn uses float13_e5m7fn packed into uint13 and has -130560 to 130560 range.
float12_e5m6fn uses float12_e5m6fn packed into uint12 and has -130048 to 130048 range.
float11_e5m5fn uses float11_e5m5fn packed into uint11 and has -129024 to 129024 range.
float10_e5m4fn uses float10_e5m4fn packed into uint10 and has -126976 to 126976 range.
float9_e4m4fn uses float9_e4m4fn packed into uint9 and has -496 to 496 range.
float8_e4m3fn uses float8_e4m3fn and has -448 to 448 range.
float7_e3m3fn uses float7_e3m3fn packed into uint7 and has -30 to 30 range.
float6_e3m2fn uses float6_e3m2fn packed into uint6 and has -28 to 28 range.
float5_e2m2fn uses float5_e2m2fn packed into uint5 and has -7.0 to 7.0 range.
float4_e2m1fn uses float4_e2m1fn packed into uint4 and has -6.0 to 6.0 range.
float3_e1m1fn uses float3_e1m1fn packed into uint3 and has -3.0 to 3.0 range.
float2_e1m0fn uses float2_e1m0fn packed into uint2 and has -2.0 to 2.0 range.
float1_e1m0fnu uses float1_e1m0fnu packed into uint1 and has 0 to 2.0 range.

Quantization type for Text Encoders

Same as Quantization type but for the Text Encoders.
default option will use the same type as Quantization type.

Quantized MatMul type

Overrides the Quantized MatMul type.
Default is auto which will use INT8 MatMul with INT and UINT types, FP8 MatMul with FP types below 16 bits and Quantized FP16 MatMul with FP types above 16 bits.
This option has no effect if Use Quantized MatMul options is disabled.

Modules to not convert

A comma separated list of module names to skip quantization.
Modules listed in this option will not be quantized and will be kept in full precision.
An example list: transformer_blocks.0.img_mod.1.weight, transformer_blocks.0.*, img_in
Default is empty.

Modules dtype dict

A JSON dictionary of quantization types and module names list used to quantize the model with mixed quantization types.

Quantization types can be any valid quantization type supported by SDNQ or it can also be minimum_Xbit.
minimum_Xbit will quantize the specified modules into the specifed bit if the main quantization dtype has less precision.
For example, minimum_6bit will quantize the specified modules to int6 if you are using int5 or below but won't do anything if you are using int6 or above.
Default is empty.

An example dict:

{
"int8": ["transformer_blocks.0.img_mod.1.weight", "transformer_blocks.0.*"],
"minimum_6bit": ["img_in"]
}

Group size

Used to decide how many elements of a tensor will share the same quantization group.
Higher values have better performance and less memory usage but with less quality.
Default is 0, meaning it will decide the group size based on your quantization type setting.
Linear layers will use this formula to find the group size: 2 ** (2 + number_of_bits)
Convolutions will use this formula to find the group size: 2 ** (1 + number_of_bits)
Setting the group size to -1 will disable grouping.

Using Quantized MatMul will disable group sizes if the number of bits is a 6 or more and group size is set to 0.

SVD rank size

The rank size to use for SVD quantization.
Higher values have better quality but with less performance and more memory usage.
Default is 32.

SVD steps

The number of steps to use in the lowrank SVD estimation.
Higher values have better quality but takes longer to quantize.
Default is 8.

Dynamic loss threshold

The target threshold to use with Dynamic quantization.
SDNQ uses STD normalized MSE loss to calculate its quantization loss and this option will be used as the loss target for it.

Setting the loss threshold to a negative value or the value None will make SDNQ auto-select a threshold based on the weights dtype.
The formula used to calculate the threshold is: 10 ** -(num_bits / 2)

Some recommended target presets are:

16 bit: 1e-8
14 bit: 1e-7
12 bit: 1e-6
10 bit: 1e-5
8 bit: 1e-4
6 bit: 1e-3
4 bit: 1e-2
2 bit: 1e-1

These targets are not set in stone and might require some trial and error depending on the model.
Default is None

Use SVD quantization

Enabling this option will apply SVD quantization on top of SDNQ quantization.
SVD has much higher quality but runs slower.
SVD also makes Loras usable with 4 bit quantization.
More info on SVD quantization: https://arxiv.org/abs/2411.05007
Disabled by default.

Note: SVD lowrank used by SDNQ is not deterministic.
Meaning that you will get slightly different quantization results every time.

Use Dynamic quantization

Enabling this option will dynamically select a per layer quantization type based on the Dynamic loss threshold.
And the current Quantization type will be used as the minimum allowed quantization type when this option is enabled.
This option takes longer to quantize but has much higher quality depending on your settings.

Simplified version of what this option does is:
If the minimum allowed quantization type can't achieve good quantization loss on a specific layer, then it will increase the quantization type just for that layer until it achieves a good quantization loss.
This will take time to test on quantization and the end model might be slightly larger but the resulting model will have much higher quality.

This option should be preferred over SVD first.
If you can't get good enough results with Dynamic quantization alone, then you can combine it with SVD too.
Disabled by default.

Quantize convolutional layers

Enabling this option will quantize convolutional layers in UNet models too.
Has much better memory savings but lower quality.

Dequantize using torch.compile

Uses Triton via torch.compile on the dequantization step.
Has significantly higher performance.
This setting requires a full restart of the webui to apply.
Enabled by default if Triton is available.

Use Quantized MatMul

Enabling this option will use quantized INT8 or FP8 MatMul instead of BF16 / FP16 when running the model.
Has significantly higher performance on GPUs with INT8 or FP8 support.
Requires Triton. Disabled by default.

Supported GPUs
- Nvidia
Requires Turing (RTX 2000) or newer GPUs for INT8 matmul.
Requires Ada (RTX 4000) or newer GPUs for FP8 matmul.
- AMD
Requires RDNA2 (RX 6000) or newer GPUs for INT8 matmul.
Requires MI300X or RDNA4 (RX 9000) for FP8 matmul.
- RDNA3 (RX 7000) supports INT8 matmul but runs at the same speed as FP16.
- RDNA2 (RX 6000) and older GPUs are supported via Triton.
- Intel
Requires Alchemist (Arc A) or newer GPUs for INT8 matmul.
Intel doesn't support FP8 matmul.
Intel can emulate FP8 matmul but it will be very slow.

Recommended quant type to use with this option is int8 for quality and better hardware compatibility.
INT8 matmul is also significantly faster than FP8 matmul on consumer Nvidia GPUs.
Recommended quant type for FP8 matmul is float8_e4m3fn.

Use Quantized MatMul with convolutional layers

Same as Use Quantized MatMul but for the convolutional layers with UNets like SDXL.
Disabled by default.

Quantize using GPU

Enabling this option will use the GPU for quantization calculations on model load.
Can be faster with weak CPUs but can also be slower because of the GPU to CPU communication overhead.
Enabled by default.

When Model load device map in the Models & Loading settings is set to default or cpu this option will send a part of the model weights to the GPU and quantize it, then will send it back to the CPU right away.
If device map is set to gpu, model weights will be loaded directly into GPU and the quantized model weights will be kept in the GPU until the quantization of the current model part is over.

If Model offload mode is set to none, quantized model weights will be sent to the GPU after quantization and will stay in the GPU.
If Model offload mode is set to model, quantized model weights will be sent to the GPU after quantization and will be sent back to the CPU after the quantization of the current model part is over.

Dequantize using full precision

Enabling this option will use FP32 on the dequantization step and will keep the scales, zero points and svd in FP32.
Has higher quality outputs but also has higher memory usage.
Enabled by default.