NNCF Quantization
NNCF provides full cross-platform quantization to reduce memory usage and increase performance for any device.
Usage
- Go into
Quantization Settings
- Enable the desired Quantization options under the
NNCF
menu
(Model, Transformer, TE and LLM are the main targets for most use cases) - Reload the model
Note:
VAE Upcast has to be set to false if you use the VAE option.
If you get black images with SDXL models, use the FP16 Fixed VAE.
Features
- Supports
INT8
,INT8_SYM
,INT4
andINT4_SYM
quantization schemes INT8_SYM
is very close to the original 16 bit quality- Supports compute optimizations using Triton via
torch.compile
- Supports direct INT8 MatMul with significant speedups on INT8 supported GPUs
- Supports on the fly quantization during model load with DiT models (called as
pre
mode) - Supports quantization for the convolutional layers with UNet models
- Supports post load quantization for any model
- Supports on the fly usage of LoRa models
- Supports balanced offload
Disadvantages
- Fused projections are not compatible with NNCF
Options
Quantization enabled
Used to decide which parts of the model will get quantized.
Recommended options are Model
and TE
with post mode or Transformer
, TE
and LLM
on pre mode.
Default is none.
Model
is used quantize the UNet on post mode or every model part on pre mode.
Transformer
is used to quantize the DiT models.
VAE
is used to quantize the VAE.
TE
is used to quantize the Text Encoders.
Video
is used to quantize the Video models.
LLM
is used to quantize the LLM part of the models that uses LLMs as Text Encoders.
ControlNet
is used to quantize ControlNets.
Quantization mode
Used to decide when the quantization step will happen on model load.
Pre
mode will quantize the model while the model is loading. Reduces system RAM usage.
Post
mode will quantize the model after the model is loaded into system RAM.
Pre
mode is compatible with DiT and Video models like Flux but older UNet models like SDXL are only compatible with post
mode.
Default is post
.
Quantization type
Used to decide the data type used to store the model weights.
Recommended types are INT8_SYM
for 8 bit and INT4
for 4 bit.
Default is INT8_SYM
.
INT8 quants have very similar quality to the full 16 bit precision while using 2 times less memory.
INT4 quants have lower quality and less performance but uses 4 times less memory.
SYM quants have the extra _SYM
added to their name while the ASYM quants don't have any suffix.
ASYM types: INT8
and INT4
SYM types: INT8_SYM
and INT4_SYM
ASYM quants uses unsigned integers, meaning they can't store negative values and will use another variable called zero point for this purpose.
SYM quants can store negative and positive values meaning they don't have extra zero point value and they run faster than ASYM quants because of this.
INT8
uses uint8 and has 0 to 255 range.
INT8_SYM
uses int8 and has -128 to 127 range.
INT4
uses two uint4 values packed into a single uint8 value and has 0 to 15 range.
INT4_SYM
uses two int4 values packed into a single uint8 value and has -8 to 7 range.
Group size
Used to decide how many elements of a tensor will share the same quantization group.
Higher values have better performance but less quality.
Default is 0
, meaning it will decide the group size based on your quantization type setting.
INT4 quants will use group size 64
and INT8 quants won't use any grouping.
Setting the group size to -1
will disable grouping.
Quantize the convolutional layers
Enabling this option will quantize the convolutional layers in UNet models too.
Has better memory savings but lower quality.
Quantizing the VAE is not recommended with this option.
Group sizes are not supported on convolutions.
Disabled by default.
Decompress using full precision
Enabling this option will use FP32
on the decompression step.
Has higher quality outputs but lower performance.
Disabled by default.
Decompress using torch.compile
Uses Triton via torch.compile
on the decompression step.
Has significantly higher performance.
Enabled by default if Triton is available.
Use direct INT8 MatMul
Enabling this option will use direct INT8 MatMul instead of BF16 / FP16.
Has significantly higher performance on GPUs with INT8 support but has lower quality.
Direct INT8 MatMul is only compatible with SYM quants.
Groups sizes will be disabled when direct INT8 MatMul is enabled.
Convolutions won't use direct INT8 MatMul.
Disabled by default.
Memory usage results
These results compares NNCF INT8 to 16 bit.
For performance results, please check out the benchmarks on the Quantization Wiki.
- Model:
Compresses UNet or Transformers part of the model.
This is where the most memory savings happens for Stable Diffusion.
SDXL: 2500 MB~ memory savings.
SD 1.5: 750 MB~ memory savings.
PixArt-XL-2: 600 MB~ memory savings.
- Text Encoder:
Compresses Text Encoder parts of the model.
This is where the most memory savings happens for PixArt.
PixArt-XL-2: 4750 MB~ memory savings.
SDXL: 750 MB~ memory savings.
SD 1.5: 120 MB~ memory savings.
- VAE:
Compresses VAE part of the model.
Memory savings from compressing VAE is pretty small.
SD 1.5 / SDXL / PixArt-XL-2: 75 MB~ memory savings.