Skip to content

SD.Next Documentation

Model Compression

NNCF Model Comporession

Usage

Go into Quantization Settings
Enable the desired Quantization options under the NNCF menu
(Model, Transformer, TE and LLM are the main targets for most use cases)
Reload the model

Note: VAE Upcast has to be set to false if you use the VAE option.
If you get black images with SDXL models, use the FP16 Fixed VAE.

Features

Supports INT8, INT8_SYM, INT4 and INT4_SYM quantization schemes
INT8_SYM is very close to the original 16 bit quality
Supports on the fly quantization during model load with DiT models (called as pre mode)
Supports quantization for the convolutional layers with UNet models
Supports post load quantization for any model
Supports on the fly usage of LoRa models
Supports balanced offload

Disadvantages

It is Autocast, GPU will still use 16 Bit to run the model and will be slower
Fused projections are not compatible with NNCF

Options

These results compares NNCF INT8 to 16 bit.

Model:
Compresses UNet or Transformers part of the model.
This is where the most memory savings happens for Stable Diffusion.

SDXL: 2500 MB~ memory savings.
SD 1.5: 750 MB~ memory savings.
PixArt-XL-2: 600 MB~ memory savings.

Text Encoder:
Compresses Text Encoder parts of the model.
This is where the most memory savings happens for PixArt.

PixArt-XL-2: 4750 MB~ memory savings.
SDXL: 750 MB~ memory savings.
SD 1.5: 120 MB~ memory savings.

VAE:
Compresses VAE part of the model.
Memory savings from compressing VAE is pretty small.

SD 1.5 / SDXL / PixArt-XL-2: 75 MB~ memory savings.