SDNQ Quantization
SD.Next Quantization provides full cross-platform quantization to reduce memory usage and increase performance for any device.
SDNQ was originally based on NNCF, but has been re-implemented, optimized and evolved enough to become its own quantization method!
Usage
- Go into
Quantization Settings
- Enable the desired Quantization options under the
SDNQ
menu
(Model, Transformer, TE and LLM are the main targets for most use cases) - Reload the model
Features
- SDNQ is fully cross-platform, supports all GPUs and includes many quantization methods:
- 8-bit, 6-bit, 4-bit, 2-bit and 1-bit int and uint
- 8-bit e5, e4 and fnuz float
note:int8
is very close to the original 16 bit quality - Supports nearly all model types
- Supports compute optimizations using Triton via
torch.compile
- Supports Quantized MatMul with significant speedups on INT8 or FP8 supported GPUs
- Supports on the fly quantization during model load with DiT models (called as
pre
mode) - Supports quantization for the convolutional layers with UNet models
- Supports post load quantization for any model
- Supports on the fly usage of LoRa models
- Supports balanced offload
Options
Quantization enabled
Used to decide which parts of the model will get quantized.
Recommended options are Model
and TE
with post mode or Transformer
, TE
and LLM
on pre mode.
Default is none.
Model
is used quantize the UNet on post mode or every model part on pre mode.
Transformer
is used to quantize the DiT models.
VAE
is used to quantize the VAE. Using the VAE option is not recommended.
TE
is used to quantize the Text Encoders.
Video
is used to quantize the Video models.
LLM
is used to quantize the LLM part of the models that uses LLMs as Text Encoders.
ControlNet
is used to quantize ControlNets.
Note:
VAE Upcast has to be set to false if you use the VAE option with FP16.
If you get black images with SDXL models, use the FP16 Fixed VAE.
Quantization mode
Used to decide when the quantization step will happen on model load.
Pre
mode will quantize the model while the model is loading. Reduces system RAM usage.
Post
mode will quantize the model after the model is loaded into system RAM.
Pre
mode is compatible with DiT and Video models like Flux but older UNet models like SDXL are only compatible with post
mode.
Default is pre
.
Quantization type
Used to decide the data type used to store the model weights.
Recommended types are int8
for 8 bit, int6
for 6 bit, float8_e4m3fn
for fp8 and uint4
for 4 bit.
Default is int8
.
INT8 quants have very similar quality to the full 16 bit precision while using 2 times less memory.
INT6 quants are the middle ground. Similar quality to to the full 16 bit precision while using 2.7 times less memory.
INT4 quants have lower quality and less performance but uses 3.6 times less memory.
FP8 quants have similar quality to INT6 but with the same memory usage as INT8.
Unsigned quants have the extra u
added to the start of their name while the symetric quants don't have any prefix.
Unsigned (asymetric) types: uint8
, uint6
, uint4
, uint2
and uint1
Symetric types: int8
, int6
, int4
, float8_e4m3fn
, float8_e5m2
, float8_e4m3fnuz
and float8_e5m2fnuz
Unsigned quants uses unsigned integers, meaning they can't store negative values and will use another variable called zero point for this purpose.
Symetric quants can store negative and positive values meaning they don't have extra zero point value and they run faster than unsigned quants because of this.
int8
uses int8 and has -128 to 127 range.
uint8
uses uint8 and has 0 to 255 range.
int6
uses four int6 values packed into three uint8 values and has -32 to 31 range.
uint6
uses four uint6 values packed into three uint8 values and has 0 to 63 range.
int4
uses two int4 values packed into a single uint8 value and has -8 to 7 range.
uint4
uses two uint4 values packed into a single uint8 value and has 0 to 15 range.
int2
uses four int2 values packed into a single uint8 value and has -2 to 1 range.
uint2
uses four uint2 values packed into a single uint8 value and has 0 to 3 range.
uint1
uses boolean and has 0 to 1 range.
float8_e4m3fn
uses float8_e4m3fn and has -448 to 448 range.
float8_e5m2
uses float8_e5m2 and has -57344 to 57344 range.
float8_e4m3fnuz
uses float8_e4m3fnuz and has -240 to 240 range.
float8_e5m2fnuz
uses float8_e5m2fnuz and has -57344 to 57344 range.
Group size
Used to decide how many elements of a tensor will share the same quantization group.
Higher values have better performance but less quality.
Default is 0
, meaning it will decide the group size based on your quantization type setting.
INT8, INT6 and FP8 quants won't use any grouping by default.
Other quant types will use this formula to find the group size: 2 ** (2 + number_of_bits)
Setting the group size to -1
will disable grouping.
Quantize the convolutional layers
Enabling this option will quantize the convolutional layers in UNet models too.
Has much better memory savings but lower quality.
Convolutions won't use group sizes with INT8 and FP8 quants.
Convolutions will use this formula to find the group size with other quant types: 2 ** (1 + number_of_bits)
Convolutions will use uint4
with uint2, int2 and uint1 quants.
Disabled by default.
Decompress using torch.compile
Uses Triton via torch.compile
on the decompression step.
Has significantly higher performance.
This setting requires a full restart of the webui to apply.
Enabled by default if Triton is available.
Use Quantized MatMul
Enabling this option will use quantized INT8 or FP8 MatMul instead of BF16 / FP16.
Has significantly higher performance on GPUs with INT8 or FP8 support.
Recommended quant type to use with this option is int8
because INT8 matmul tends to be faster than FP8.
Quantized INT8 MatMul is only compatible with int8
, int4
and int2
quant types.
Quantized FP8 MatMul is only compatible with float8_e4m3fn
and float8_e5m2
quant types.
Groups sizes will be disabled when Quantized MatMul is enabled.
Disabled by default.
Use Quantized MatMul with convolutional layers
Same as Use Quantized MatMul
but for the convolutional layers with UNets like SDXL.
Disabled by default.
Quantize with the GPU
Enabling this option will use the GPU with the quantization calculations on model load.
Can be faster with weak CPUs but can also be slower because of GPU to CPU communication overhead.
Enabled by default.
When Model load device map
in the Models & Loading
settings is set to default
or cpu
this option will send a part of the model weights to the GPU and quantize it, then will send it back to the CPU right away.
If device map is set to gpu
, model weights will be loaded directly into GPU and the quantized model weights will be kept in the GPU until the quantization of the current model part is over.
If Model offload mode
is set to none
, quantized model weights will be sent to the GPU regardless of this setting and will stay in the GPU.
If Model offload mode
is set to model
, quantized model weights will be sent to the GPU regardless of this setting and will be sent back to the CPU after the quantization of the current model part is over.
Decompress using full precision
Enabling this option will use FP32
on the decompression step.
Has higher quality outputs but lower performance.
Disabled by default.