lodestones
/

Chroma

Text-to-Image

English

image-generation

chroma

Model card Files Files and versions Community

lodestones commited on Mar 9

Commit

9fce474

verified ·

1 Parent(s): d855967

Update README.md

Browse files

Files changed (1) hide show

README.md +46 -0

README.md CHANGED Viewed

@@ -17,6 +17,18 @@ Based on `FLUX.1 [schnell]` with heavy architectural modifications.
 ![Alpha_Preview](./collage.png)
 # How to run this model
@@ -48,3 +60,37 @@ git clone https://github.com/lodestone-rock/ComfyUI_FluxMod.git
 3. put `Chroma checkpoint` into `ComfyUI/models/diffusion_models` folder
 4. load chroma workflow to your ComfyUI
 5. Run the workflow

 ![Alpha_Preview](./collage.png)
+## Table of Contents
+- [How to run this model](#how-to-run-this-model)
+  - [ComfyUI](#comfyui)
+  - diffusers [WIP]
+- brief tech report
+  - [Architectural modifications]
+    - [12B → 8.9B](#12b-%E2%86%92-89b)
+    - [MMDiT masking](#mmdit-masking)
+    - [Timestep Distributions](#timestep-distributions)
+    - [Minibatch Optimal Transport](#minibatch-optimal-transport)
 # How to run this model
 3. put `Chroma checkpoint` into `ComfyUI/models/diffusion_models` folder
 4. load chroma workflow to your ComfyUI
 5. Run the workflow
+# Architectural Modifications
+## 12B → 8.9B
+### TL;DR: There are 3.3B parameters that only encode a single input vector, which I replaced with 250M params.
+Since FLUX is so big, I had to modify the architecture and ensure minimal knowledge was lost in the process. The most obvious thing to prune was this modulation layer. In the diagram, it may look small, but in total, FLUX has 3.3B parameters allocated to it. Without glazing over the details too much, this layer's job is to let the model know which timestep it's at during the denoising process. This layer also receives information from pooled CLIP vectors.
+[graph placeholder]
+But after a simple experiment of zeroing these pooled vectors out, the model’s output barely changed—which made pruning a breeze! Why? Because the only information left for this layer to encode is just a single number in the range of 0-1.
+Yes, you heard it right—3.3B parameters were used to encode 8 bytes of float values. So this was the most obvious layer to prune and replace with a simple FFN. The whole replacement process only took a day on my single 3090, and after that, the model size was reduced to just 8.9B.
+## MMDiT Masking
+### TL;DR: Masking T5 padding tokens enhanced fidelity and increased stability during training.
+It might not be obvious, but BFL had some oversight during pre-training where they forgot to mask both T5 and MMDiT tokens. So, for example, a short sentence like “a cat sat on a mat” actually looks like this in both T5 and MMDiT:
+`<bos> a cat sat on a mat <pad><pad>...<pad><pad><pad>`
+[graph placeholder]
+The model ends up paying way too much attention to padding tokens, drowning out the actual prompt information. The fix? Masking—so the model doesn’t associate anything with padding tokens.
+But there’s a catch: if you mask out all padding tokens, the model falls out of distribution and generates a blurry mess. The solution? Unmask just one padding token while masking the rest.
+With this fix, MMDiT now only needs to pay attention to:
+`<bos> a cat sat on a mat <pad>`
+## Timestep Distributions
+### TL;DR: A custom timestep distribution prevents loss spikes during training.
+When training a diffusion/flow model, we sample random timesteps—but not evenly. Why? Because empirically, training on certain timesteps more often makes the model converge faster.
+FLUX uses a "lognorm" distribution, which prioritizes training around the middle timesteps. But this approach has a flaw: the tails—where high-noise and low-noise regions exist—are trained super sparsely.
+If you train for a looong time (say, 1000 steps), the likelihood of hitting those tail regions is almost zero. The problem? When the model finally does see them, the loss spikes hard, throwing training out of whack—even with a huge batch size.
+The fix is simple: sample and train those tail timesteps a bit more frequently using a `-x^2` function instead. You can see in the image that this makes the distribution thicker near 0 and 1, ensuring better coverage.
+## Minibatch Optimal Transport
+### TL;DR: Transport problem math magic :P
+This one’s a bit math-heavy, but here’s the gist: FLUX isn’t actually "denoising" an image. What we’re really doing is training a vector field to map one distribution (noise) to another (image). Once the vector field is learned, we "flow" through it to transform noise into an image.
+To keep it simple—just check out these two visuals:
+[graph placeholder]
+By choosing better pairing through math magic it accelerates training by reducing the “path ambiguity”