test / wiki /SD-Pipeline-How-it-Works.md
bilegentile's picture
Upload folder using huggingface_hub
c19ca42 verified
## Stable Diffusion Pipeline
This is probably the best end-to-end semi-technical article:
<https://stable-diffusion-art.com/how-stable-diffusion-work/>
And a detailed look at diffusion process:
<https://towardsdatascience.com/understanding-diffusion-probabilistic-models-dpms-1940329d6048>
But this is a short look at the pipeline:
1. Encoder / Conditioning
Text (via tokenizer) or image (via vision model) to semantic map
(e.g CLiP text encoder)
2. Sampler
Generate noise which is starting point to map to content
(e.g. k_lms)
3. Diffuser
Create vector content based on resolved noise + semantic map
(e.g. actual stable diffusion checkpoint)
4. Autoencoder
Maps between latent and pixel space (actually creates images from vectors)
(e.g. typically some image-database trained GAN)
5. Denoising
Get meaningful images from pixel signatures
Basically, blends what autoencoder inserted using information from diffuser
(e.g. U-NET)
6. Loop and repeat
From step#3 with cross-attention to blend results
7. Run additional models as needed
- Upscale (e.g. ESRGAN)
- Resore Face (e.g. GFPGAN or CodeFormer)