NNCF Model Comporession
Usage
- Go into
Quantization Settings
- Enable the desired Quantization options under the
NNCF
menu
(Model, Transformer, TE and LLM are the main targets for most use cases) - Reload the model
Note:
VAE Upcast has to be set to false if you use the VAE option.
If you get black images with SDXL models, use the FP16 Fixed VAE.
Features
- Supports
INT8
,INT8_SYM
,INT4
andINT4_SYM
quantization schemes INT8_SYM
is very close to the original 16 bit quality- Supports on the fly quantization during model load with DiT models (called as pre mode)
- Supports quantization for the convolutional layers with UNet models
- Supports post load quantization for any model
- Supports on the fly usage of LoRa models
- Supports balanced offload
Disadvantages
- It is Autocast, GPU will still use 16 Bit to run the model and will be slower
- Fused projections are not compatible with NNCF
Options
These results compares NNCF INT8 to 16 bit.
- Model:
Compresses UNet or Transformers part of the model.
This is where the most memory savings happens for Stable Diffusion.
SDXL: 2500 MB~ memory savings.
SD 1.5: 750 MB~ memory savings.
PixArt-XL-2: 600 MB~ memory savings.
- Text Encoder:
Compresses Text Encoder parts of the model.
This is where the most memory savings happens for PixArt.
PixArt-XL-2: 4750 MB~ memory savings.
SDXL: 750 MB~ memory savings.
SD 1.5: 120 MB~ memory savings.
- VAE:
Compresses VAE part of the model.
Memory savings from compressing VAE is pretty small.
SD 1.5 / SDXL / PixArt-XL-2: 75 MB~ memory savings.