File size: 1,174 Bytes
c19ca42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
## Stable Diffusion Pipeline

This is probably the best end-to-end semi-technical article:  
<https://stable-diffusion-art.com/how-stable-diffusion-work/>

And a detailed look at diffusion process:
<https://towardsdatascience.com/understanding-diffusion-probabilistic-models-dpms-1940329d6048>

But this is a short look at the pipeline:

1. Encoder / Conditioning
   Text (via tokenizer) or image (via vision model) to semantic map  
   (e.g CLiP text encoder)  
2. Sampler
   Generate noise which is starting point to map to content  
   (e.g. k_lms)  
3. Diffuser
   Create vector content based on resolved noise + semantic map  
   (e.g. actual stable diffusion checkpoint)  
4. Autoencoder
   Maps between latent and pixel space (actually creates images from vectors)  
   (e.g. typically some image-database trained GAN)  
5. Denoising
   Get meaningful images from pixel signatures  
   Basically, blends what autoencoder inserted using information from diffuser  
   (e.g. U-NET)
6. Loop and repeat
   From step#3 with cross-attention to blend results  
7. Run additional models as needed  
   - Upscale (e.g. ESRGAN)  
   - Resore Face (e.g. GFPGAN or CodeFormer)