test / wiki /Model-Compression-with-NNCF.md
bilegentile's picture
Upload folder using huggingface_hub
c19ca42 verified
## Usage
0. Use Diffusers backend. `Execution & Models` -> `Execution backend`
1. Go into `Compute Settings`
2. Enable `Compress Model weights with NNCF` options
3. Restart the WebUI if it's your first time using NNCF. Otherwise, just reload the model.
### Features
* Uses INT8, halves the model size
Saves 3.4 GB of VRAM with SDXL
* Works in Diffusers backend
### Disadvantages
* It is Autocast, GPU will still use 16 Bit to run the model and will be slower
* Uses INT8, can break ControlNet
* Using Lora will trigger model reload
* Not implemented in Original backend
* Fused projections are not compatible with NNCF
## Options
These results compares NNCF 8 bit to 16 bit.
- Model:
Compresses UNet or Transformers part of the model.
This is where the most memory savings happens for Stable Diffusion.
SDXL: 2500 MB~ memory savings.
SD 1.5: 750 MB~ memory savings.
PixArt-XL-2: 600 MB~ memory savings.
- Text Encoder:
Compresses Text Encoder parts of the model.
This is where the most memory savings happens for PixArt.
PixArt-XL-2: 4750 MB~ memory savings.
SDXL: 750 MB~ memory savings.
SD 1.5: 120 MB~ memory savings.
- VAE:
Compresses VAE part of the model.
Memory savings from compressing VAE is pretty small.
SD 1.5 / SDXL / PixArt-XL-2: 75 MB~ memory savings.
- 4 Bit Compression and Quantization:
4 bit compression modes and quantization can be used with OpenVINO backend.
For more info: https://github.com/vladmandic/automatic/wiki/OpenVINO#quantization