Alimoi commited on
Commit
3483284
·
verified ·
1 Parent(s): 733d9be

Upload 14 files

Browse files
custom_nodes/ComfyUI-GGUF/LICENSE ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
custom_nodes/ComfyUI-GGUF/README.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ComfyUI-GGUF
2
+ GGUF Quantization support for native ComfyUI models
3
+
4
+ This is currently very much WIP. These custom nodes provide support for model files stored in the GGUF format popularized by [llama.cpp](https://github.com/ggerganov/llama.cpp).
5
+
6
+ While quantization wasn't feasible for regular UNET models (conv2d), transformer/DiT models such as flux seem less affected by quantization. This allows running it in much lower bits per weight variable bitrate quants on low-end GPUs. For further VRAM savings, a node to load a quantized version of the T5 text encoder is also included.
7
+
8
+ ![Comfy_Flux1_dev_Q4_0_GGUF_1024](https://github.com/user-attachments/assets/70d16d97-c522-4ef4-9435-633f128644c8)
9
+
10
+ Note: The "Force/Set CLIP Device" is **NOT** part of this node pack. Do not install it if you only have one GPU. Do not set it to cuda:0 then complain about OOM errors if you do not undestand what it is for. There is not need to copy the workflow above, just use your own workflow and replace the stock "Load Diffusion Model" with the "Unet Loader (GGUF)" node.
11
+
12
+ ## Installation
13
+
14
+ > [!IMPORTANT]
15
+ > Make sure your ComfyUI is on a recent-enough version to support custom ops when loading the UNET-only.
16
+
17
+ To install the custom node normally, git clone this repository into your custom nodes folder (`ComfyUI/custom_nodes`) and install the only dependency for inference (`pip install --upgrade gguf`)
18
+
19
+ ```
20
+ git clone https://github.com/city96/ComfyUI-GGUF
21
+ ```
22
+
23
+ To install the custom node on a standalone ComfyUI release, open a CMD inside the "ComfyUI_windows_portable" folder (where your `run_nvidia_gpu.bat` file is) and use the following commands:
24
+
25
+ ```
26
+ git clone https://github.com/city96/ComfyUI-GGUF ComfyUI/custom_nodes/ComfyUI-GGUF
27
+ .\python_embeded\python.exe -s -m pip install -r .\ComfyUI\custom_nodes\ComfyUI-GGUF\requirements.txt
28
+ ```
29
+
30
+ On MacOS sequoia, torch 2.4.1 seems to be required, as 2.6.X nightly versions cause a "M1 buffer is not large enough" error. See [this issue](https://github.com/city96/ComfyUI-GGUF/issues/107) for more information/workarounds.
31
+
32
+ ## Usage
33
+
34
+ Simply use the GGUF Unet loader found under the `bootleg` category. Place the .gguf model files in your `ComfyUI/models/unet` folder.
35
+
36
+ LoRA loading is experimental but it should work with just the built-in LoRA loader node(s).
37
+
38
+ Pre-quantized models:
39
+
40
+ - [flux1-dev GGUF](https://huggingface.co/city96/FLUX.1-dev-gguf)
41
+ - [flux1-schnell GGUF](https://huggingface.co/city96/FLUX.1-schnell-gguf)
42
+ - [stable-diffusion-3.5-large GGUF](https://huggingface.co/city96/stable-diffusion-3.5-large-gguf)
43
+ - [stable-diffusion-3.5-large-turbo GGUF](https://huggingface.co/city96/stable-diffusion-3.5-large-turbo-gguf)
44
+
45
+ Initial support for quantizing T5 has also been added recently, these can be used using the various `*CLIPLoader (gguf)` nodes which can be used inplace of the regular ones. For the CLIP model, use whatever model you were using before for CLIP. The loader can handle both types of files - `gguf` and regular `safetensors`/`bin`.
46
+
47
+ - [t5_v1.1-xxl GGUF](https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf)
48
+
49
+ See the instructions in the [tools](https://github.com/city96/ComfyUI-GGUF/tree/main/tools) folder for how to create your own quants.
custom_nodes/ComfyUI-GGUF/__init__.py ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ # only import if running as a custom node
2
+ try:
3
+ import comfy.utils
4
+ except ImportError:
5
+ pass
6
+ else:
7
+ from .nodes import NODE_CLASS_MAPPINGS
8
+ NODE_DISPLAY_NAME_MAPPINGS = {k:v.TITLE for k,v in NODE_CLASS_MAPPINGS.items()}
9
+ __all__ = ['NODE_CLASS_MAPPINGS', 'NODE_DISPLAY_NAME_MAPPINGS']
custom_nodes/ComfyUI-GGUF/dequant.py ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # (c) City96 || Apache-2.0 (apache.org/licenses/LICENSE-2.0)
2
+ import gguf
3
+ import torch
4
+ from tqdm import tqdm
5
+
6
+
7
+ TORCH_COMPATIBLE_QTYPES = {None, gguf.GGMLQuantizationType.F32, gguf.GGMLQuantizationType.F16}
8
+
9
+ def is_torch_compatible(tensor):
10
+ return tensor is None or getattr(tensor, "tensor_type", None) in TORCH_COMPATIBLE_QTYPES
11
+
12
+ def is_quantized(tensor):
13
+ return not is_torch_compatible(tensor)
14
+
15
+ def dequantize_tensor(tensor, dtype=None, dequant_dtype=None):
16
+ qtype = getattr(tensor, "tensor_type", None)
17
+ oshape = getattr(tensor, "tensor_shape", tensor.shape)
18
+
19
+ if qtype in TORCH_COMPATIBLE_QTYPES:
20
+ return tensor.to(dtype)
21
+ elif qtype in dequantize_functions:
22
+ dequant_dtype = dtype if dequant_dtype == "target" else dequant_dtype
23
+ return dequantize(tensor.data, qtype, oshape, dtype=dequant_dtype).to(dtype)
24
+ else:
25
+ # this is incredibly slow
26
+ tqdm.write(f"Falling back to numpy dequant for qtype: {qtype}")
27
+ new = gguf.quants.dequantize(tensor.cpu().numpy(), qtype)
28
+ return torch.from_numpy(new).to(tensor.device, dtype=dtype)
29
+
30
+ def dequantize(data, qtype, oshape, dtype=None):
31
+ """
32
+ Dequantize tensor back to usable shape/dtype
33
+ """
34
+ block_size, type_size = gguf.GGML_QUANT_SIZES[qtype]
35
+ dequantize_blocks = dequantize_functions[qtype]
36
+
37
+ rows = data.reshape(
38
+ (-1, data.shape[-1])
39
+ ).view(torch.uint8)
40
+
41
+ n_blocks = rows.numel() // type_size
42
+ blocks = rows.reshape((n_blocks, type_size))
43
+ blocks = dequantize_blocks(blocks, block_size, type_size, dtype)
44
+ return blocks.reshape(oshape)
45
+
46
+ def to_uint32(x):
47
+ # no uint32 :(
48
+ x = x.view(torch.uint8).to(torch.int32)
49
+ return (x[:, 0] | x[:, 1] << 8 | x[:, 2] << 16 | x[:, 3] << 24).unsqueeze(1)
50
+
51
+ def split_block_dims(blocks, *args):
52
+ n_max = blocks.shape[1]
53
+ dims = list(args) + [n_max - sum(args)]
54
+ return torch.split(blocks, dims, dim=1)
55
+
56
+ # Full weights #
57
+ def dequantize_blocks_BF16(blocks, block_size, type_size, dtype=None):
58
+ return (blocks.view(torch.int16).to(torch.int32) << 16).view(torch.float32)
59
+
60
+ # Legacy Quants #
61
+ def dequantize_blocks_Q8_0(blocks, block_size, type_size, dtype=None):
62
+ d, x = split_block_dims(blocks, 2)
63
+ d = d.view(torch.float16).to(dtype)
64
+ x = x.view(torch.int8)
65
+ return (d * x)
66
+
67
+ def dequantize_blocks_Q5_1(blocks, block_size, type_size, dtype=None):
68
+ n_blocks = blocks.shape[0]
69
+
70
+ d, m, qh, qs = split_block_dims(blocks, 2, 2, 4)
71
+ d = d.view(torch.float16).to(dtype)
72
+ m = m.view(torch.float16).to(dtype)
73
+ qh = to_uint32(qh)
74
+
75
+ qh = qh.reshape((n_blocks, 1)) >> torch.arange(32, device=d.device, dtype=torch.int32).reshape(1, 32)
76
+ ql = qs.reshape((n_blocks, -1, 1, block_size // 2)) >> torch.tensor([0, 4], device=d.device, dtype=torch.uint8).reshape(1, 1, 2, 1)
77
+ qh = (qh & 1).to(torch.uint8)
78
+ ql = (ql & 0x0F).reshape((n_blocks, -1))
79
+
80
+ qs = (ql | (qh << 4))
81
+ return (d * qs) + m
82
+
83
+ def dequantize_blocks_Q5_0(blocks, block_size, type_size, dtype=None):
84
+ n_blocks = blocks.shape[0]
85
+
86
+ d, qh, qs = split_block_dims(blocks, 2, 4)
87
+ d = d.view(torch.float16).to(dtype)
88
+ qh = to_uint32(qh)
89
+
90
+ qh = qh.reshape(n_blocks, 1) >> torch.arange(32, device=d.device, dtype=torch.int32).reshape(1, 32)
91
+ ql = qs.reshape(n_blocks, -1, 1, block_size // 2) >> torch.tensor([0, 4], device=d.device, dtype=torch.uint8).reshape(1, 1, 2, 1)
92
+
93
+ qh = (qh & 1).to(torch.uint8)
94
+ ql = (ql & 0x0F).reshape(n_blocks, -1)
95
+
96
+ qs = (ql | (qh << 4)).to(torch.int8) - 16
97
+ return (d * qs)
98
+
99
+ def dequantize_blocks_Q4_1(blocks, block_size, type_size, dtype=None):
100
+ n_blocks = blocks.shape[0]
101
+
102
+ d, m, qs = split_block_dims(blocks, 2, 2)
103
+ d = d.view(torch.float16).to(dtype)
104
+ m = m.view(torch.float16).to(dtype)
105
+
106
+ qs = qs.reshape((n_blocks, -1, 1, block_size // 2)) >> torch.tensor([0, 4], device=d.device, dtype=torch.uint8).reshape(1, 1, 2, 1)
107
+ qs = (qs & 0x0F).reshape(n_blocks, -1)
108
+
109
+ return (d * qs) + m
110
+
111
+ def dequantize_blocks_Q4_0(blocks, block_size, type_size, dtype=None):
112
+ n_blocks = blocks.shape[0]
113
+
114
+ d, qs = split_block_dims(blocks, 2)
115
+ d = d.view(torch.float16).to(dtype)
116
+
117
+ qs = qs.reshape((n_blocks, -1, 1, block_size // 2)) >> torch.tensor([0, 4], device=d.device, dtype=torch.uint8).reshape((1, 1, 2, 1))
118
+ qs = (qs & 0x0F).reshape((n_blocks, -1)).to(torch.int8) - 8
119
+ return (d * qs)
120
+
121
+ # K Quants #
122
+ QK_K = 256
123
+ K_SCALE_SIZE = 12
124
+
125
+ def get_scale_min(scales):
126
+ n_blocks = scales.shape[0]
127
+ scales = scales.view(torch.uint8)
128
+ scales = scales.reshape((n_blocks, 3, 4))
129
+
130
+ d, m, m_d = torch.split(scales, scales.shape[-2] // 3, dim=-2)
131
+
132
+ sc = torch.cat([d & 0x3F, (m_d & 0x0F) | ((d >> 2) & 0x30)], dim=-1)
133
+ min = torch.cat([m & 0x3F, (m_d >> 4) | ((m >> 2) & 0x30)], dim=-1)
134
+
135
+ return (sc.reshape((n_blocks, 8)), min.reshape((n_blocks, 8)))
136
+
137
+ def dequantize_blocks_Q6_K(blocks, block_size, type_size, dtype=None):
138
+ n_blocks = blocks.shape[0]
139
+
140
+ ql, qh, scales, d, = split_block_dims(blocks, QK_K // 2, QK_K // 4, QK_K // 16)
141
+
142
+ scales = scales.view(torch.int8).to(dtype)
143
+ d = d.view(torch.float16).to(dtype)
144
+ d = (d * scales).reshape((n_blocks, QK_K // 16, 1))
145
+
146
+ ql = ql.reshape((n_blocks, -1, 1, 64)) >> torch.tensor([0, 4], device=d.device, dtype=torch.uint8).reshape((1, 1, 2, 1))
147
+ ql = (ql & 0x0F).reshape((n_blocks, -1, 32))
148
+ qh = qh.reshape((n_blocks, -1, 1, 32)) >> torch.tensor([0, 2, 4, 6], device=d.device, dtype=torch.uint8).reshape((1, 1, 4, 1))
149
+ qh = (qh & 0x03).reshape((n_blocks, -1, 32))
150
+ q = (ql | (qh << 4)).to(torch.int8) - 32
151
+ q = q.reshape((n_blocks, QK_K // 16, -1))
152
+
153
+ return (d * q).reshape((n_blocks, QK_K))
154
+
155
+ def dequantize_blocks_Q5_K(blocks, block_size, type_size, dtype=None):
156
+ n_blocks = blocks.shape[0]
157
+
158
+ d, dmin, scales, qh, qs = split_block_dims(blocks, 2, 2, K_SCALE_SIZE, QK_K // 8)
159
+
160
+ d = d.view(torch.float16).to(dtype)
161
+ dmin = dmin.view(torch.float16).to(dtype)
162
+
163
+ sc, m = get_scale_min(scales)
164
+
165
+ d = (d * sc).reshape((n_blocks, -1, 1))
166
+ dm = (dmin * m).reshape((n_blocks, -1, 1))
167
+
168
+ ql = qs.reshape((n_blocks, -1, 1, 32)) >> torch.tensor([0, 4], device=d.device, dtype=torch.uint8).reshape((1, 1, 2, 1))
169
+ qh = qh.reshape((n_blocks, -1, 1, 32)) >> torch.tensor([i for i in range(8)], device=d.device, dtype=torch.uint8).reshape((1, 1, 8, 1))
170
+ ql = (ql & 0x0F).reshape((n_blocks, -1, 32))
171
+ qh = (qh & 0x01).reshape((n_blocks, -1, 32))
172
+ q = (ql | (qh << 4))
173
+
174
+ return (d * q - dm).reshape((n_blocks, QK_K))
175
+
176
+ def dequantize_blocks_Q4_K(blocks, block_size, type_size, dtype=None):
177
+ n_blocks = blocks.shape[0]
178
+
179
+ d, dmin, scales, qs = split_block_dims(blocks, 2, 2, K_SCALE_SIZE)
180
+ d = d.view(torch.float16).to(dtype)
181
+ dmin = dmin.view(torch.float16).to(dtype)
182
+
183
+ sc, m = get_scale_min(scales)
184
+
185
+ d = (d * sc).reshape((n_blocks, -1, 1))
186
+ dm = (dmin * m).reshape((n_blocks, -1, 1))
187
+
188
+ qs = qs.reshape((n_blocks, -1, 1, 32)) >> torch.tensor([0, 4], device=d.device, dtype=torch.uint8).reshape((1, 1, 2, 1))
189
+ qs = (qs & 0x0F).reshape((n_blocks, -1, 32))
190
+
191
+ return (d * qs - dm).reshape((n_blocks, QK_K))
192
+
193
+ def dequantize_blocks_Q3_K(blocks, block_size, type_size, dtype=None):
194
+ n_blocks = blocks.shape[0]
195
+
196
+ hmask, qs, scales, d = split_block_dims(blocks, QK_K // 8, QK_K // 4, 12)
197
+ d = d.view(torch.float16).to(dtype)
198
+
199
+ lscales, hscales = scales[:, :8], scales[:, 8:]
200
+ lscales = lscales.reshape((n_blocks, 1, 8)) >> torch.tensor([0, 4], device=d.device, dtype=torch.uint8).reshape((1, 2, 1))
201
+ lscales = lscales.reshape((n_blocks, 16))
202
+ hscales = hscales.reshape((n_blocks, 1, 4)) >> torch.tensor([0, 2, 4, 6], device=d.device, dtype=torch.uint8).reshape((1, 4, 1))
203
+ hscales = hscales.reshape((n_blocks, 16))
204
+ scales = (lscales & 0x0F) | ((hscales & 0x03) << 4)
205
+ scales = (scales.to(torch.int8) - 32)
206
+
207
+ dl = (d * scales).reshape((n_blocks, 16, 1))
208
+
209
+ ql = qs.reshape((n_blocks, -1, 1, 32)) >> torch.tensor([0, 2, 4, 6], device=d.device, dtype=torch.uint8).reshape((1, 1, 4, 1))
210
+ qh = hmask.reshape(n_blocks, -1, 1, 32) >> torch.tensor([i for i in range(8)], device=d.device, dtype=torch.uint8).reshape((1, 1, 8, 1))
211
+ ql = ql.reshape((n_blocks, 16, QK_K // 16)) & 3
212
+ qh = (qh.reshape((n_blocks, 16, QK_K // 16)) & 1) ^ 1
213
+ q = (ql.to(torch.int8) - (qh << 2).to(torch.int8))
214
+
215
+ return (dl * q).reshape((n_blocks, QK_K))
216
+
217
+ def dequantize_blocks_Q2_K(blocks, block_size, type_size, dtype=None):
218
+ n_blocks = blocks.shape[0]
219
+
220
+ scales, qs, d, dmin = split_block_dims(blocks, QK_K // 16, QK_K // 4, 2)
221
+ d = d.view(torch.float16).to(dtype)
222
+ dmin = dmin.view(torch.float16).to(dtype)
223
+
224
+ # (n_blocks, 16, 1)
225
+ dl = (d * (scales & 0xF)).reshape((n_blocks, QK_K // 16, 1))
226
+ ml = (dmin * (scales >> 4)).reshape((n_blocks, QK_K // 16, 1))
227
+
228
+ shift = torch.tensor([0, 2, 4, 6], device=d.device, dtype=torch.uint8).reshape((1, 1, 4, 1))
229
+
230
+ qs = (qs.reshape((n_blocks, -1, 1, 32)) >> shift) & 3
231
+ qs = qs.reshape((n_blocks, QK_K // 16, 16))
232
+ qs = dl * qs - ml
233
+
234
+ return qs.reshape((n_blocks, -1))
235
+
236
+ dequantize_functions = {
237
+ gguf.GGMLQuantizationType.BF16: dequantize_blocks_BF16,
238
+ gguf.GGMLQuantizationType.Q8_0: dequantize_blocks_Q8_0,
239
+ gguf.GGMLQuantizationType.Q5_1: dequantize_blocks_Q5_1,
240
+ gguf.GGMLQuantizationType.Q5_0: dequantize_blocks_Q5_0,
241
+ gguf.GGMLQuantizationType.Q4_1: dequantize_blocks_Q4_1,
242
+ gguf.GGMLQuantizationType.Q4_0: dequantize_blocks_Q4_0,
243
+ gguf.GGMLQuantizationType.Q6_K: dequantize_blocks_Q6_K,
244
+ gguf.GGMLQuantizationType.Q5_K: dequantize_blocks_Q5_K,
245
+ gguf.GGMLQuantizationType.Q4_K: dequantize_blocks_Q4_K,
246
+ gguf.GGMLQuantizationType.Q3_K: dequantize_blocks_Q3_K,
247
+ gguf.GGMLQuantizationType.Q2_K: dequantize_blocks_Q2_K,
248
+ }
custom_nodes/ComfyUI-GGUF/loader.py ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # (c) City96 || Apache-2.0 (apache.org/licenses/LICENSE-2.0)
2
+ import torch
3
+ import gguf
4
+
5
+ from .ops import GGMLTensor
6
+ from .dequant import is_quantized, dequantize_tensor
7
+
8
+ IMG_ARCH_LIST = {"flux", "sd1", "sdxl", "sd3", "aura", "ltxv", "hyvid", "wan"}
9
+ TXT_ARCH_LIST = {"t5", "t5encoder", "llama"}
10
+
11
+ def get_orig_shape(reader, tensor_name):
12
+ field_key = f"comfy.gguf.orig_shape.{tensor_name}"
13
+ field = reader.get_field(field_key)
14
+ if field is None:
15
+ return None
16
+ # Has original shape metadata, so we try to decode it.
17
+ if len(field.types) != 2 or field.types[0] != gguf.GGUFValueType.ARRAY or field.types[1] != gguf.GGUFValueType.INT32:
18
+ raise TypeError(f"Bad original shape metadata for {field_key}: Expected ARRAY of INT32, got {field.types}")
19
+ return torch.Size(tuple(int(field.parts[part_idx][0]) for part_idx in field.data))
20
+
21
+ def get_field(reader, field_name, field_type):
22
+ field = reader.get_field(field_name)
23
+ if field is None:
24
+ return None
25
+ elif field_type == str:
26
+ # extra check here as this is used for checking arch string
27
+ if len(field.types) != 1 or field.types[0] != gguf.GGUFValueType.STRING:
28
+ raise TypeError(f"Bad type for GGUF {field_name} key: expected string, got {field.types!r}")
29
+ return str(field.parts[field.data[-1]], encoding="utf-8")
30
+ elif field_type in [int, float, bool]:
31
+ return field_type(field.parts[field.data[-1]])
32
+ else:
33
+ raise TypeError(f"Unknown field type {field_type}")
34
+
35
+ def get_list_field(reader, field_name, field_type):
36
+ field = reader.get_field(field_name)
37
+ if field is None:
38
+ return None
39
+ elif field_type == str:
40
+ return tuple(str(field.parts[part_idx], encoding="utf-8") for part_idx in field.data)
41
+ elif field_type in [int, float, bool]:
42
+ return tuple(field_type(field.parts[part_idx][0]) for part_idx in field.data)
43
+ else:
44
+ raise TypeError(f"Unknown field type {field_type}")
45
+
46
+ def gguf_sd_loader(path, handle_prefix="model.diffusion_model.", return_arch=False):
47
+ """
48
+ Read state dict as fake tensors
49
+ """
50
+ reader = gguf.GGUFReader(path)
51
+
52
+ # filter and strip prefix
53
+ has_prefix = False
54
+ if handle_prefix is not None:
55
+ prefix_len = len(handle_prefix)
56
+ tensor_names = set(tensor.name for tensor in reader.tensors)
57
+ has_prefix = any(s.startswith(handle_prefix) for s in tensor_names)
58
+
59
+ tensors = []
60
+ for tensor in reader.tensors:
61
+ sd_key = tensor_name = tensor.name
62
+ if has_prefix:
63
+ if not tensor_name.startswith(handle_prefix):
64
+ continue
65
+ sd_key = tensor_name[prefix_len:]
66
+ tensors.append((sd_key, tensor))
67
+
68
+ # detect and verify architecture
69
+ compat = None
70
+ arch_str = get_field(reader, "general.architecture", str)
71
+ if arch_str is None: # stable-diffusion.cpp
72
+ # import here to avoid changes to convert.py breaking regular models
73
+ from .tools.convert import detect_arch
74
+ arch_str = detect_arch(set(val[0] for val in tensors)).arch
75
+ compat = "sd.cpp"
76
+ elif arch_str in ["pig"]:
77
+ from .tools.convert import detect_arch
78
+ arch_str = detect_arch(set(val[0] for val in tensors)).arch
79
+ compat = "pig"
80
+ elif arch_str not in IMG_ARCH_LIST and arch_str not in TXT_ARCH_LIST:
81
+ raise ValueError(f"Unexpected architecture type in GGUF file: {arch_str!r}")
82
+
83
+ if compat:
84
+ print(f"Warning: This model file is loaded in compatibility mode '{compat}' [arch:{arch_str}]")
85
+
86
+ # main loading loop
87
+ state_dict = {}
88
+ qtype_dict = {}
89
+ for sd_key, tensor in tensors:
90
+ tensor_name = tensor.name
91
+ torch_tensor = torch.from_numpy(tensor.data) # mmap
92
+
93
+ shape = get_orig_shape(reader, tensor_name)
94
+ if shape is None:
95
+ shape = torch.Size(tuple(int(v) for v in reversed(tensor.shape)))
96
+ # Workaround for stable-diffusion.cpp SDXL detection.
97
+ if compat == "sd.cpp" and arch_str == "sdxl":
98
+ if any([tensor_name.endswith(x) for x in (".proj_in.weight", ".proj_out.weight")]):
99
+ while len(shape) > 2 and shape[-1] == 1:
100
+ shape = shape[:-1]
101
+
102
+ # add to state dict
103
+ if tensor.tensor_type in {gguf.GGMLQuantizationType.F32, gguf.GGMLQuantizationType.F16}:
104
+ torch_tensor = torch_tensor.view(*shape)
105
+ state_dict[sd_key] = GGMLTensor(torch_tensor, tensor_type=tensor.tensor_type, tensor_shape=shape)
106
+
107
+ # keep track of loaded tensor types
108
+ tensor_type_str = getattr(tensor.tensor_type, "name", repr(tensor.tensor_type))
109
+ qtype_dict[tensor_type_str] = qtype_dict.get(tensor_type_str, 0) + 1
110
+
111
+ # print loaded tensor type counts
112
+ print("gguf qtypes: " + ", ".join(f"{k} ({v})" for k, v in qtype_dict.items()))
113
+
114
+ # mark largest tensor for vram estimation
115
+ qsd = {k:v for k,v in state_dict.items() if is_quantized(v)}
116
+ if len(qsd) > 0:
117
+ max_key = max(qsd.keys(), key=lambda k: qsd[k].numel())
118
+ state_dict[max_key].is_largest_weight = True
119
+
120
+ if return_arch:
121
+ return (state_dict, arch_str)
122
+ return state_dict
123
+
124
+ # for remapping llama.cpp -> original key names
125
+ T5_SD_MAP = {
126
+ "enc.": "encoder.",
127
+ ".blk.": ".block.",
128
+ "token_embd": "shared",
129
+ "output_norm": "final_layer_norm",
130
+ "attn_q": "layer.0.SelfAttention.q",
131
+ "attn_k": "layer.0.SelfAttention.k",
132
+ "attn_v": "layer.0.SelfAttention.v",
133
+ "attn_o": "layer.0.SelfAttention.o",
134
+ "attn_norm": "layer.0.layer_norm",
135
+ "attn_rel_b": "layer.0.SelfAttention.relative_attention_bias",
136
+ "ffn_up": "layer.1.DenseReluDense.wi_1",
137
+ "ffn_down": "layer.1.DenseReluDense.wo",
138
+ "ffn_gate": "layer.1.DenseReluDense.wi_0",
139
+ "ffn_norm": "layer.1.layer_norm",
140
+ }
141
+
142
+ LLAMA_SD_MAP = {
143
+ "blk.": "model.layers.",
144
+ "attn_norm": "input_layernorm",
145
+ "attn_q": "self_attn.q_proj",
146
+ "attn_k": "self_attn.k_proj",
147
+ "attn_v": "self_attn.v_proj",
148
+ "attn_output": "self_attn.o_proj",
149
+ "ffn_up": "mlp.up_proj",
150
+ "ffn_down": "mlp.down_proj",
151
+ "ffn_gate": "mlp.gate_proj",
152
+ "ffn_norm": "post_attention_layernorm",
153
+ "token_embd": "model.embed_tokens",
154
+ "output_norm": "model.norm",
155
+ "output.weight": "lm_head.weight",
156
+ }
157
+
158
+ def sd_map_replace(raw_sd, key_map):
159
+ sd = {}
160
+ for k,v in raw_sd.items():
161
+ for s,d in key_map.items():
162
+ k = k.replace(s,d)
163
+ sd[k] = v
164
+ return sd
165
+
166
+ def llama_permute(raw_sd, n_head, n_head_kv):
167
+ # Reverse version of LlamaModel.permute in llama.cpp convert script
168
+ sd = {}
169
+ permute = lambda x,h: x.reshape(h, x.shape[0] // h // 2, 2, *x.shape[1:]).swapaxes(1, 2).reshape(x.shape)
170
+ for k,v in raw_sd.items():
171
+ if k.endswith(("q_proj.weight", "q_proj.bias")):
172
+ v.data = permute(v.data, n_head)
173
+ if k.endswith(("k_proj.weight", "k_proj.bias")):
174
+ v.data = permute(v.data, n_head_kv)
175
+ sd[k] = v
176
+ return sd
177
+
178
+ def gguf_tokenizer_loader(path, temb_shape):
179
+ # convert gguf tokenizer to spiece
180
+ print(f"Attempting to recreate sentencepiece tokenizer from GGUF file metadata...")
181
+ try:
182
+ from sentencepiece import sentencepiece_model_pb2 as model
183
+ except ImportError:
184
+ raise ImportError("Please make sure sentencepiece and protobuf are installed.\npip install sentencepiece protobuf")
185
+ spm = model.ModelProto()
186
+
187
+ reader = gguf.GGUFReader(path)
188
+
189
+ if get_field(reader, "tokenizer.ggml.model", str) == "t5":
190
+ if temb_shape == (256384, 4096): # probably UMT5
191
+ spm.trainer_spec.model_type == 1 # Unigram (do we have a T5 w/ BPE?)
192
+ else:
193
+ raise NotImplementedError(f"Unknown model, can't set tokenizer!")
194
+ else:
195
+ raise NotImplementedError(f"Unknown model, can't set tokenizer!")
196
+
197
+ spm.normalizer_spec.add_dummy_prefix = get_field(reader, "tokenizer.ggml.add_space_prefix", bool)
198
+ spm.normalizer_spec.remove_extra_whitespaces = get_field(reader, "tokenizer.ggml.remove_extra_whitespaces", bool)
199
+
200
+ tokens = get_list_field(reader, "tokenizer.ggml.tokens", str)
201
+ scores = get_list_field(reader, "tokenizer.ggml.scores", float)
202
+ toktypes = get_list_field(reader, "tokenizer.ggml.token_type", int)
203
+
204
+ for idx, (token, score, toktype) in enumerate(zip(tokens, scores, toktypes)):
205
+ # # These aren't present in the original?
206
+ # if toktype == 5 and idx >= temb_shape[0]%1000):
207
+ # continue
208
+
209
+ piece = spm.SentencePiece()
210
+ piece.piece = token
211
+ piece.score = score
212
+ piece.type = toktype
213
+ spm.pieces.append(piece)
214
+
215
+ # unsure if any of these are correct
216
+ spm.trainer_spec.byte_fallback = True
217
+ spm.trainer_spec.vocab_size = len(tokens) # split off unused?
218
+ spm.trainer_spec.max_sentence_length = 4096
219
+ spm.trainer_spec.eos_id = get_field(reader, "tokenizer.ggml.eos_token_id", int)
220
+ spm.trainer_spec.pad_id = get_field(reader, "tokenizer.ggml.padding_token_id", int)
221
+
222
+ print(f"Created tokenizer with vocab size of {len(spm.pieces)}")
223
+ del reader
224
+ return torch.ByteTensor(list(spm.SerializeToString()))
225
+
226
+ def gguf_clip_loader(path):
227
+ sd, arch = gguf_sd_loader(path, return_arch=True)
228
+ if arch in {"t5", "t5encoder"}:
229
+ temb_key = "token_embd.weight"
230
+ if temb_key in sd and sd[temb_key].shape == (256384, 4096):
231
+ # non-standard Comfy-Org tokenizer
232
+ sd["spiece_model"] = gguf_tokenizer_loader(path, sd[temb_key].shape)
233
+ # TODO: dequantizing token embed here is janky but otherwise we OOM due to tensor being massive.
234
+ print(f"Dequantizing {temb_key} to prevent runtime OOM.")
235
+ sd[temb_key] = dequantize_tensor(sd[temb_key], dtype=torch.float16)
236
+ sd = sd_map_replace(sd, T5_SD_MAP)
237
+ elif arch in {"llama"}:
238
+ temb_key = "token_embd.weight"
239
+ if temb_key in sd and sd[temb_key].shape != (128320, 4096):
240
+ # This still works. Raise error?
241
+ print("Warning! token_embd shape may be incorrect for llama 3 model!")
242
+ sd = sd_map_replace(sd, LLAMA_SD_MAP)
243
+ sd = llama_permute(sd, 32, 8) # L3
244
+ else:
245
+ pass
246
+ return sd
custom_nodes/ComfyUI-GGUF/nodes.py ADDED
@@ -0,0 +1,297 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # (c) City96 || Apache-2.0 (apache.org/licenses/LICENSE-2.0)
2
+ import torch
3
+ import logging
4
+ import collections
5
+
6
+ import comfy.sd
7
+ import comfy.utils
8
+ import comfy.model_patcher
9
+ import comfy.model_management
10
+ import folder_paths
11
+
12
+ from .ops import GGMLOps, move_patch_to_device
13
+ from .loader import gguf_sd_loader, gguf_clip_loader
14
+ from .dequant import is_quantized, is_torch_compatible
15
+
16
+ def update_folder_names_and_paths(key, targets=[]):
17
+ # check for existing key
18
+ base = folder_paths.folder_names_and_paths.get(key, ([], {}))
19
+ base = base[0] if isinstance(base[0], (list, set, tuple)) else []
20
+ # find base key & add w/ fallback, sanity check + warning
21
+ target = next((x for x in targets if x in folder_paths.folder_names_and_paths), targets[0])
22
+ orig, _ = folder_paths.folder_names_and_paths.get(target, ([], {}))
23
+ folder_paths.folder_names_and_paths[key] = (orig or base, {".gguf"})
24
+ if base and base != orig:
25
+ logging.warning(f"Unknown file list already present on key {key}: {base}")
26
+
27
+ # Add a custom keys for files ending in .gguf
28
+ update_folder_names_and_paths("unet_gguf", ["diffusion_models", "unet"])
29
+ update_folder_names_and_paths("clip_gguf", ["text_encoders", "clip"])
30
+
31
+ class GGUFModelPatcher(comfy.model_patcher.ModelPatcher):
32
+ patch_on_device = False
33
+
34
+ def patch_weight_to_device(self, key, device_to=None, inplace_update=False):
35
+ if key not in self.patches:
36
+ return
37
+ weight = comfy.utils.get_attr(self.model, key)
38
+
39
+ try:
40
+ from comfy.lora import calculate_weight
41
+ except Exception:
42
+ calculate_weight = self.calculate_weight
43
+
44
+ patches = self.patches[key]
45
+ if is_quantized(weight):
46
+ out_weight = weight.to(device_to)
47
+ patches = move_patch_to_device(patches, self.load_device if self.patch_on_device else self.offload_device)
48
+ # TODO: do we ever have legitimate duplicate patches? (i.e. patch on top of patched weight)
49
+ out_weight.patches = [(calculate_weight, patches, key)]
50
+ else:
51
+ inplace_update = self.weight_inplace_update or inplace_update
52
+ if key not in self.backup:
53
+ self.backup[key] = collections.namedtuple('Dimension', ['weight', 'inplace_update'])(
54
+ weight.to(device=self.offload_device, copy=inplace_update), inplace_update
55
+ )
56
+
57
+ if device_to is not None:
58
+ temp_weight = comfy.model_management.cast_to_device(weight, device_to, torch.float32, copy=True)
59
+ else:
60
+ temp_weight = weight.to(torch.float32, copy=True)
61
+
62
+ out_weight = calculate_weight(patches, temp_weight, key)
63
+ out_weight = comfy.float.stochastic_rounding(out_weight, weight.dtype)
64
+
65
+ if inplace_update:
66
+ comfy.utils.copy_to_param(self.model, key, out_weight)
67
+ else:
68
+ comfy.utils.set_attr_param(self.model, key, out_weight)
69
+
70
+ def unpatch_model(self, device_to=None, unpatch_weights=True):
71
+ if unpatch_weights:
72
+ for p in self.model.parameters():
73
+ if is_torch_compatible(p):
74
+ continue
75
+ patches = getattr(p, "patches", [])
76
+ if len(patches) > 0:
77
+ p.patches = []
78
+ # TODO: Find another way to not unload after patches
79
+ return super().unpatch_model(device_to=device_to, unpatch_weights=unpatch_weights)
80
+
81
+ mmap_released = False
82
+ def load(self, *args, force_patch_weights=False, **kwargs):
83
+ # always call `patch_weight_to_device` even for lowvram
84
+ super().load(*args, force_patch_weights=True, **kwargs)
85
+
86
+ # make sure nothing stays linked to mmap after first load
87
+ if not self.mmap_released:
88
+ linked = []
89
+ if kwargs.get("lowvram_model_memory", 0) > 0:
90
+ for n, m in self.model.named_modules():
91
+ if hasattr(m, "weight"):
92
+ device = getattr(m.weight, "device", None)
93
+ if device == self.offload_device:
94
+ linked.append((n, m))
95
+ continue
96
+ if hasattr(m, "bias"):
97
+ device = getattr(m.bias, "device", None)
98
+ if device == self.offload_device:
99
+ linked.append((n, m))
100
+ continue
101
+ if linked:
102
+ print(f"Attempting to release mmap ({len(linked)})")
103
+ for n, m in linked:
104
+ # TODO: possible to OOM, find better way to detach
105
+ m.to(self.load_device).to(self.offload_device)
106
+ self.mmap_released = True
107
+
108
+ def clone(self, *args, **kwargs):
109
+ src_cls = self.__class__
110
+ self.__class__ = GGUFModelPatcher
111
+ n = super().clone(*args, **kwargs)
112
+ n.__class__ = GGUFModelPatcher
113
+ self.__class__ = src_cls
114
+ # GGUF specific clone values below
115
+ n.patch_on_device = getattr(self, "patch_on_device", False)
116
+ return n
117
+
118
+ class UnetLoaderGGUF:
119
+ @classmethod
120
+ def INPUT_TYPES(s):
121
+ unet_names = [x for x in folder_paths.get_filename_list("unet_gguf")]
122
+ return {
123
+ "required": {
124
+ "unet_name": (unet_names,),
125
+ }
126
+ }
127
+
128
+ RETURN_TYPES = ("MODEL",)
129
+ FUNCTION = "load_unet"
130
+ CATEGORY = "bootleg"
131
+ TITLE = "Unet Loader (GGUF)"
132
+
133
+ def load_unet(self, unet_name, dequant_dtype=None, patch_dtype=None, patch_on_device=None):
134
+ ops = GGMLOps()
135
+
136
+ if dequant_dtype in ("default", None):
137
+ ops.Linear.dequant_dtype = None
138
+ elif dequant_dtype in ["target"]:
139
+ ops.Linear.dequant_dtype = dequant_dtype
140
+ else:
141
+ ops.Linear.dequant_dtype = getattr(torch, dequant_dtype)
142
+
143
+ if patch_dtype in ("default", None):
144
+ ops.Linear.patch_dtype = None
145
+ elif patch_dtype in ["target"]:
146
+ ops.Linear.patch_dtype = patch_dtype
147
+ else:
148
+ ops.Linear.patch_dtype = getattr(torch, patch_dtype)
149
+
150
+ # init model
151
+ unet_path = folder_paths.get_full_path("unet", unet_name)
152
+ sd = gguf_sd_loader(unet_path)
153
+ model = comfy.sd.load_diffusion_model_state_dict(
154
+ sd, model_options={"custom_operations": ops}
155
+ )
156
+ if model is None:
157
+ logging.error("ERROR UNSUPPORTED UNET {}".format(unet_path))
158
+ raise RuntimeError("ERROR: Could not detect model type of: {}".format(unet_path))
159
+ model = GGUFModelPatcher.clone(model)
160
+ model.patch_on_device = patch_on_device
161
+ return (model,)
162
+
163
+ class UnetLoaderGGUFAdvanced(UnetLoaderGGUF):
164
+ @classmethod
165
+ def INPUT_TYPES(s):
166
+ unet_names = [x for x in folder_paths.get_filename_list("unet_gguf")]
167
+ return {
168
+ "required": {
169
+ "unet_name": (unet_names,),
170
+ "dequant_dtype": (["default", "target", "float32", "float16", "bfloat16"], {"default": "default"}),
171
+ "patch_dtype": (["default", "target", "float32", "float16", "bfloat16"], {"default": "default"}),
172
+ "patch_on_device": ("BOOLEAN", {"default": False}),
173
+ }
174
+ }
175
+ TITLE = "Unet Loader (GGUF/Advanced)"
176
+
177
+ # Mapping from common name to name used in comfy.sd.CLIPType enum
178
+ CLIP_ENUM_MAP = {
179
+ "stable_diffusion": "STABLE_DIFFUSION",
180
+ "stable_cascade": "STABLE_CASCADE",
181
+ "stable_audio": "STABLE_AUDIO",
182
+ "sdxl": "STABLE_DIFFUSION",
183
+ "sd3": "SD3",
184
+ "flux": "FLUX",
185
+ "mochi": "MOCHI",
186
+ "ltxv": "LTXV",
187
+ "hunyuan_video": "HUNYUAN_VIDEO",
188
+ "pixart": "PIXART",
189
+ "wan": "WAN",
190
+ }
191
+
192
+ def get_clip_type(name):
193
+ enum_name = CLIP_ENUM_MAP.get(name, None)
194
+ if enum_name is None:
195
+ raise ValueError(f"Unknown CLIP model type {name}")
196
+ clip_type = getattr(comfy.sd.CLIPType, CLIP_ENUM_MAP[name], None)
197
+ if clip_type is None:
198
+ raise ValueError(f"Unsupported CLIP model type {name} (Update ComfyUI)")
199
+ return clip_type
200
+
201
+ class CLIPLoaderGGUF:
202
+ @classmethod
203
+ def INPUT_TYPES(s):
204
+ return {
205
+ "required": {
206
+ "clip_name": (s.get_filename_list(),),
207
+ "type": (["stable_diffusion", "stable_cascade", "sd3", "stable_audio", "mochi", "ltxv", "pixart", "wan"],),
208
+ }
209
+ }
210
+
211
+ RETURN_TYPES = ("CLIP",)
212
+ FUNCTION = "load_clip"
213
+ CATEGORY = "bootleg"
214
+ TITLE = "CLIPLoader (GGUF)"
215
+
216
+ @classmethod
217
+ def get_filename_list(s):
218
+ files = []
219
+ files += folder_paths.get_filename_list("clip")
220
+ files += folder_paths.get_filename_list("clip_gguf")
221
+ return sorted(files)
222
+
223
+ def load_data(self, ckpt_paths):
224
+ clip_data = []
225
+ for p in ckpt_paths:
226
+ if p.endswith(".gguf"):
227
+ sd = gguf_clip_loader(p)
228
+ else:
229
+ sd = comfy.utils.load_torch_file(p, safe_load=True)
230
+ clip_data.append(sd)
231
+ return clip_data
232
+
233
+ def load_patcher(self, clip_paths, clip_type, clip_data):
234
+ clip = comfy.sd.load_text_encoder_state_dicts(
235
+ clip_type = clip_type,
236
+ state_dicts = clip_data,
237
+ model_options = {
238
+ "custom_operations": GGMLOps,
239
+ "initial_device": comfy.model_management.text_encoder_offload_device()
240
+ },
241
+ embedding_directory = folder_paths.get_folder_paths("embeddings"),
242
+ )
243
+ clip.patcher = GGUFModelPatcher.clone(clip.patcher)
244
+ return clip
245
+
246
+ def load_clip(self, clip_name, type="stable_diffusion"):
247
+ clip_path = folder_paths.get_full_path("clip", clip_name)
248
+ return (self.load_patcher([clip_path], get_clip_type(type), self.load_data([clip_path])),)
249
+
250
+ class DualCLIPLoaderGGUF(CLIPLoaderGGUF):
251
+ @classmethod
252
+ def INPUT_TYPES(s):
253
+ file_options = (s.get_filename_list(), )
254
+ return {
255
+ "required": {
256
+ "clip_name1": file_options,
257
+ "clip_name2": file_options,
258
+ "type": (("sdxl", "sd3", "flux", "hunyuan_video"),),
259
+ }
260
+ }
261
+
262
+ TITLE = "DualCLIPLoader (GGUF)"
263
+
264
+ def load_clip(self, clip_name1, clip_name2, type):
265
+ clip_path1 = folder_paths.get_full_path("clip", clip_name1)
266
+ clip_path2 = folder_paths.get_full_path("clip", clip_name2)
267
+ clip_paths = (clip_path1, clip_path2)
268
+ return (self.load_patcher(clip_paths, get_clip_type(type), self.load_data(clip_paths)),)
269
+
270
+ class TripleCLIPLoaderGGUF(CLIPLoaderGGUF):
271
+ @classmethod
272
+ def INPUT_TYPES(s):
273
+ file_options = (s.get_filename_list(), )
274
+ return {
275
+ "required": {
276
+ "clip_name1": file_options,
277
+ "clip_name2": file_options,
278
+ "clip_name3": file_options,
279
+ }
280
+ }
281
+
282
+ TITLE = "TripleCLIPLoader (GGUF)"
283
+
284
+ def load_clip(self, clip_name1, clip_name2, clip_name3, type="sd3"):
285
+ clip_path1 = folder_paths.get_full_path("clip", clip_name1)
286
+ clip_path2 = folder_paths.get_full_path("clip", clip_name2)
287
+ clip_path3 = folder_paths.get_full_path("clip", clip_name3)
288
+ clip_paths = (clip_path1, clip_path2, clip_path3)
289
+ return (self.load_patcher(clip_paths, get_clip_type(type), self.load_data(clip_paths)),)
290
+
291
+ NODE_CLASS_MAPPINGS = {
292
+ "UnetLoaderGGUF": UnetLoaderGGUF,
293
+ "CLIPLoaderGGUF": CLIPLoaderGGUF,
294
+ "DualCLIPLoaderGGUF": DualCLIPLoaderGGUF,
295
+ "TripleCLIPLoaderGGUF": TripleCLIPLoaderGGUF,
296
+ "UnetLoaderGGUFAdvanced": UnetLoaderGGUFAdvanced,
297
+ }
custom_nodes/ComfyUI-GGUF/ops.py ADDED
@@ -0,0 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # (c) City96 || Apache-2.0 (apache.org/licenses/LICENSE-2.0)
2
+ import gguf
3
+ import torch
4
+
5
+ import comfy.ops
6
+ import comfy.model_management
7
+ from .dequant import dequantize_tensor, is_quantized
8
+
9
+ # to avoid breaking really old pytorch versions
10
+ if hasattr(torch, "compiler") and hasattr(torch.compiler, "disable"):
11
+ torch_compiler_disable = torch.compiler.disable
12
+ else:
13
+ def torch_compiler_disable(*args, **kwargs):
14
+ def noop(x):
15
+ return x
16
+ return noop
17
+
18
+ class GGMLTensor(torch.Tensor):
19
+ """
20
+ Main tensor-like class for storing quantized weights
21
+ """
22
+ def __init__(self, *args, tensor_type, tensor_shape, patches=[], **kwargs):
23
+ super().__init__()
24
+ self.tensor_type = tensor_type
25
+ self.tensor_shape = tensor_shape
26
+ self.patches = patches
27
+
28
+ def __new__(cls, *args, tensor_type, tensor_shape, patches=[], **kwargs):
29
+ return super().__new__(cls, *args, **kwargs)
30
+
31
+ def to(self, *args, **kwargs):
32
+ new = super().to(*args, **kwargs)
33
+ new.tensor_type = getattr(self, "tensor_type", None)
34
+ new.tensor_shape = getattr(self, "tensor_shape", new.data.shape)
35
+ new.patches = getattr(self, "patches", []).copy()
36
+ return new
37
+
38
+ def clone(self, *args, **kwargs):
39
+ return self
40
+
41
+ def detach(self, *args, **kwargs):
42
+ return self
43
+
44
+ def copy_(self, *args, **kwargs):
45
+ # fixes .weight.copy_ in comfy/clip_model/CLIPTextModel
46
+ try:
47
+ return super().copy_(*args, **kwargs)
48
+ except Exception as e:
49
+ print(f"ignoring 'copy_' on tensor: {e}")
50
+
51
+ def new_empty(self, size, *args, **kwargs):
52
+ # Intel Arc fix, ref#50
53
+ new_tensor = super().new_empty(size, *args, **kwargs)
54
+ return GGMLTensor(
55
+ new_tensor,
56
+ tensor_type = getattr(self, "tensor_type", None),
57
+ tensor_shape = size,
58
+ patches = getattr(self, "patches", []).copy()
59
+ )
60
+
61
+ @property
62
+ def shape(self):
63
+ if not hasattr(self, "tensor_shape"):
64
+ self.tensor_shape = self.size()
65
+ return self.tensor_shape
66
+
67
+ class GGMLLayer(torch.nn.Module):
68
+ """
69
+ This (should) be responsible for de-quantizing on the fly
70
+ """
71
+ comfy_cast_weights = True
72
+ dequant_dtype = None
73
+ patch_dtype = None
74
+ largest_layer = False
75
+ torch_compatible_tensor_types = {None, gguf.GGMLQuantizationType.F32, gguf.GGMLQuantizationType.F16}
76
+
77
+ def is_ggml_quantized(self, *, weight=None, bias=None):
78
+ if weight is None:
79
+ weight = self.weight
80
+ if bias is None:
81
+ bias = self.bias
82
+ return is_quantized(weight) or is_quantized(bias)
83
+
84
+ def _load_from_state_dict(self, state_dict, prefix, *args, **kwargs):
85
+ weight, bias = state_dict.get(f"{prefix}weight"), state_dict.get(f"{prefix}bias")
86
+ # NOTE: using modified load for linear due to not initializing on creation, see GGMLOps todo
87
+ if self.is_ggml_quantized(weight=weight, bias=bias) or isinstance(self, torch.nn.Linear):
88
+ return self.ggml_load_from_state_dict(state_dict, prefix, *args, **kwargs)
89
+ return super()._load_from_state_dict(state_dict, prefix, *args, **kwargs)
90
+
91
+ def ggml_load_from_state_dict(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs):
92
+ prefix_len = len(prefix)
93
+ for k,v in state_dict.items():
94
+ if k[prefix_len:] == "weight":
95
+ self.weight = torch.nn.Parameter(v, requires_grad=False)
96
+ elif k[prefix_len:] == "bias" and v is not None:
97
+ self.bias = torch.nn.Parameter(v, requires_grad=False)
98
+ else:
99
+ unexpected_keys.append(k)
100
+
101
+ # For Linear layer with missing weight
102
+ if self.weight is None and isinstance(self, torch.nn.Linear):
103
+ v = torch.zeros(self.in_features, self.out_features)
104
+ self.weight = torch.nn.Parameter(v, requires_grad=False)
105
+ missing_keys.append(prefix+"weight")
106
+
107
+ # for vram estimation (TODO: less fragile logic?)
108
+ if getattr(self.weight, "is_largest_weight", False):
109
+ self.largest_layer = True
110
+
111
+ def _save_to_state_dict(self, *args, **kwargs):
112
+ if self.is_ggml_quantized():
113
+ return self.ggml_save_to_state_dict(*args, **kwargs)
114
+ return super()._save_to_state_dict(*args, **kwargs)
115
+
116
+ def ggml_save_to_state_dict(self, destination, prefix, keep_vars):
117
+ # This is a fake state dict for vram estimation
118
+ weight = torch.zeros_like(self.weight, device=torch.device("meta"))
119
+ destination[prefix + "weight"] = weight
120
+ if self.bias is not None:
121
+ bias = torch.zeros_like(self.bias, device=torch.device("meta"))
122
+ destination[prefix + "bias"] = bias
123
+
124
+ # Take into account space required for dequantizing the largest tensor
125
+ if self.largest_layer:
126
+ shape = getattr(self.weight, "tensor_shape", self.weight.shape)
127
+ dtype = self.dequant_dtype or torch.float16
128
+ temp = torch.empty(*shape, device=torch.device("meta"), dtype=dtype)
129
+ destination[prefix + "temp.weight"] = temp
130
+
131
+ return
132
+ # This would return the dequantized state dict
133
+ destination[prefix + "weight"] = self.get_weight(self.weight)
134
+ if bias is not None:
135
+ destination[prefix + "bias"] = self.get_weight(self.bias)
136
+
137
+ def get_weight(self, tensor, dtype):
138
+ if tensor is None:
139
+ return
140
+
141
+ # consolidate and load patches to GPU in async
142
+ patch_list = []
143
+ device = tensor.device
144
+ for function, patches, key in getattr(tensor, "patches", []):
145
+ patch_list += move_patch_to_device(patches, device)
146
+
147
+ # dequantize tensor while patches load
148
+ weight = dequantize_tensor(tensor, dtype, self.dequant_dtype)
149
+
150
+ # prevent propagating custom tensor class
151
+ if isinstance(weight, GGMLTensor):
152
+ weight.__class__ = torch.Tensor
153
+
154
+ # apply patches
155
+ if patch_list:
156
+ if self.patch_dtype is None:
157
+ weight = function(patch_list, weight, key)
158
+ else:
159
+ # for testing, may degrade image quality
160
+ patch_dtype = dtype if self.patch_dtype == "target" else self.patch_dtype
161
+ weight = function(patch_list, weight, key, patch_dtype)
162
+ return weight
163
+
164
+ @torch_compiler_disable()
165
+ def cast_bias_weight(s, input=None, dtype=None, device=None, bias_dtype=None):
166
+ if input is not None:
167
+ if dtype is None:
168
+ dtype = getattr(input, "dtype", torch.float32)
169
+ if bias_dtype is None:
170
+ bias_dtype = dtype
171
+ if device is None:
172
+ device = input.device
173
+
174
+ bias = None
175
+ non_blocking = comfy.model_management.device_supports_non_blocking(device)
176
+ if s.bias is not None:
177
+ bias = s.get_weight(s.bias.to(device), dtype)
178
+ bias = comfy.ops.cast_to(bias, bias_dtype, device, non_blocking=non_blocking, copy=False)
179
+
180
+ weight = s.get_weight(s.weight.to(device), dtype)
181
+ weight = comfy.ops.cast_to(weight, dtype, device, non_blocking=non_blocking, copy=False)
182
+ return weight, bias
183
+
184
+ def forward_comfy_cast_weights(self, input, *args, **kwargs):
185
+ if self.is_ggml_quantized():
186
+ out = self.forward_ggml_cast_weights(input, *args, **kwargs)
187
+ else:
188
+ out = super().forward_comfy_cast_weights(input, *args, **kwargs)
189
+
190
+ # non-ggml forward might still propagate custom tensor class
191
+ if isinstance(out, GGMLTensor):
192
+ out.__class__ = torch.Tensor
193
+ return out
194
+
195
+ def forward_ggml_cast_weights(self, input):
196
+ raise NotImplementedError
197
+
198
+ class GGMLOps(comfy.ops.manual_cast):
199
+ """
200
+ Dequantize weights on the fly before doing the compute
201
+ """
202
+ class Linear(GGMLLayer, comfy.ops.manual_cast.Linear):
203
+ def __init__(self, in_features, out_features, bias=True, device=None, dtype=None):
204
+ torch.nn.Module.__init__(self)
205
+ # TODO: better workaround for reserved memory spike on windows
206
+ # Issue is with `torch.empty` still reserving the full memory for the layer
207
+ # Windows doesn't over-commit memory so without this 24GB+ of pagefile is used
208
+ self.in_features = in_features
209
+ self.out_features = out_features
210
+ self.weight = None
211
+ self.bias = None
212
+
213
+ def forward_ggml_cast_weights(self, input):
214
+ weight, bias = self.cast_bias_weight(input)
215
+ return torch.nn.functional.linear(input, weight, bias)
216
+
217
+ class Conv2d(GGMLLayer, comfy.ops.manual_cast.Conv2d):
218
+ def forward_ggml_cast_weights(self, input):
219
+ weight, bias = self.cast_bias_weight(input)
220
+ return self._conv_forward(input, weight, bias)
221
+
222
+ class Embedding(GGMLLayer, comfy.ops.manual_cast.Embedding):
223
+ def forward_ggml_cast_weights(self, input, out_dtype=None):
224
+ output_dtype = out_dtype
225
+ if self.weight.dtype == torch.float16 or self.weight.dtype == torch.bfloat16:
226
+ out_dtype = None
227
+ weight, _bias = self.cast_bias_weight(self, device=input.device, dtype=out_dtype)
228
+ return torch.nn.functional.embedding(
229
+ input, weight, self.padding_idx, self.max_norm, self.norm_type, self.scale_grad_by_freq, self.sparse
230
+ ).to(dtype=output_dtype)
231
+
232
+ class LayerNorm(GGMLLayer, comfy.ops.manual_cast.LayerNorm):
233
+ def forward_ggml_cast_weights(self, input):
234
+ if self.weight is None:
235
+ return super().forward_comfy_cast_weights(input)
236
+ weight, bias = self.cast_bias_weight(input)
237
+ return torch.nn.functional.layer_norm(input, self.normalized_shape, weight, bias, self.eps)
238
+
239
+ class GroupNorm(GGMLLayer, comfy.ops.manual_cast.GroupNorm):
240
+ def forward_ggml_cast_weights(self, input):
241
+ weight, bias = self.cast_bias_weight(input)
242
+ return torch.nn.functional.group_norm(input, self.num_groups, weight, bias, self.eps)
243
+
244
+ def move_patch_to_device(item, device):
245
+ if isinstance(item, torch.Tensor):
246
+ return item.to(device, non_blocking=True)
247
+ elif isinstance(item, tuple):
248
+ return tuple(move_patch_to_device(x, device) for x in item)
249
+ elif isinstance(item, list):
250
+ return [move_patch_to_device(x, device) for x in item]
251
+ else:
252
+ return item
custom_nodes/ComfyUI-GGUF/requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # main
2
+ gguf>=0.13.0
3
+ # optional - tokenizer
4
+ sentencepiece
5
+ protobuf
custom_nodes/ComfyUI-GGUF/tools/README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## STEP 1 (Patch files with Unix (LF) line endings
2
+
3
+ Solution to fix lines endings of the patch files from Windows (CRLF) to Unix (LF)
4
+
5
+ ```
6
+ python fix_lines_ending.py
7
+ ```
8
+ ## STEP 2 (Clone llama.cpp version of gguf-py)
9
+
10
+ Git clone llama.cpp into the current folder. You may also install gguf-py from the llama.cpp repo directly, though the one specified in `requirements.txt` should also work on recent versions.
11
+
12
+ ```
13
+ git clone https://github.com/ggerganov/llama.cpp
14
+ pip install llama.cpp/gguf-py
15
+ ```
16
+
17
+ ## STEP 3 (Convert to FP16 or BF16)
18
+
19
+ To convert your initial source model to FP16 (or BF16), run the following command:
20
+ ```
21
+ python convert.py --src E:\models\unet\flux1-dev.safetensors
22
+ ```
23
+ ## STEP 4 (Patch llama.cpp)
24
+
25
+ - To quantize the model, first apply the provided patch to the llama.cpp repo you've just cloned.
26
+ ```
27
+ cd llama.cpp
28
+ git checkout tags/b3600
29
+ git apply ..\lcpp.patch
30
+ ```
31
+
32
+ - To quantize **SD3** or **AuraFlow** models, you should use the patch `lcpp_sd3.patch` and target `tags/b3962` instead.
33
+ - There is a [WIP PR for other model architectures](https://github.com/city96/ComfyUI-GGUF/pull/216)
34
+ ```
35
+ cd llama.cpp
36
+ git checkout tags/b3962
37
+ git apply ..\lcpp_sd3.patch
38
+ ```
39
+
40
+
41
+ ## STEP 5 (Compile llama-quantize binary)
42
+
43
+ Then, compile the llama-quantize binary. This example uses cmake, on linux you can just use make.
44
+ ```
45
+ mkdir build
46
+ cd build
47
+ cmake ..
48
+ cmake --build . --config Debug -j10 --target llama-quantize
49
+ cd ..
50
+ cd ..
51
+ ```
52
+
53
+ ## STEP 6 (Quantization)
54
+ Now you can use the newly build binary to quantize your model to the desired format:
55
+ ```
56
+ llama.cpp\build\bin\Debug\llama-quantize.exe E:\models\unet\flux1-dev-BF16.gguf E:\models\unet\flux1-dev-Q4_K_S.gguf Q4_K_S
57
+ ```
58
+
59
+
60
+ You can extract the patch again with `git diff src\llama.cpp > lcpp.patch` if you wish to change something and contribute back.
61
+
62
+
63
+ > [!WARNING]
64
+ > Do not use the diffusers UNET for flux, it won't work, use the default/reference checkpoint format. This is due to q/k/v being merged into one qkv key. You can convert it by loading it in ComfyUI and saving it using the built-in "ModelSave" node.
65
+
66
+
67
+ > [!WARNING]
68
+ > Do not quantize SDXL / SD1 / other Conv2D heavy models. There's little to no benefit with these models. If you do, make sure to **extract the UNET model first**.
69
+ >This should be obvious, but also don't use the resulting llama-quantize binary with LLMs.
custom_nodes/ComfyUI-GGUF/tools/convert.py ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # (c) City96 || Apache-2.0 (apache.org/licenses/LICENSE-2.0)
2
+ import os
3
+ import torch
4
+ import gguf # This needs to be the llama.cpp one specifically!
5
+ import argparse
6
+ from tqdm import tqdm
7
+
8
+ from safetensors.torch import load_file
9
+
10
+ QUANTIZATION_THRESHOLD = 1024
11
+ REARRANGE_THRESHOLD = 512
12
+ MAX_TENSOR_NAME_LENGTH = 127
13
+
14
+ class ModelTemplate:
15
+ arch = "invalid" # string describing architecture
16
+ shape_fix = False # whether to reshape tensors
17
+ keys_detect = [] # list of lists to match in state dict
18
+ keys_banned = [] # list of keys that should mark model as invalid for conversion
19
+
20
+ class ModelFlux(ModelTemplate):
21
+ arch = "flux"
22
+ keys_detect = [
23
+ ("transformer_blocks.0.attn.norm_added_k.weight",),
24
+ ("double_blocks.0.img_attn.proj.weight",),
25
+ ]
26
+ keys_banned = ["transformer_blocks.0.attn.norm_added_k.weight",]
27
+
28
+ class ModelSD3(ModelTemplate):
29
+ arch = "sd3"
30
+ keys_detect = [
31
+ ("transformer_blocks.0.attn.add_q_proj.weight",),
32
+ ("joint_blocks.0.x_block.attn.qkv.weight",),
33
+ ]
34
+ keys_banned = ["transformer_blocks.0.attn.add_q_proj.weight",]
35
+
36
+ class ModelAura(ModelTemplate):
37
+ arch = "aura"
38
+ keys_detect = [
39
+ ("double_layers.3.modX.1.weight",),
40
+ ("joint_transformer_blocks.3.ff_context.out_projection.weight",),
41
+ ]
42
+ keys_banned = ["joint_transformer_blocks.3.ff_context.out_projection.weight",]
43
+
44
+ class ModelLTXV(ModelTemplate):
45
+ arch = "ltxv"
46
+ keys_detect = [
47
+ (
48
+ "adaln_single.emb.timestep_embedder.linear_2.weight",
49
+ "transformer_blocks.27.scale_shift_table",
50
+ "caption_projection.linear_2.weight",
51
+ )
52
+ ]
53
+
54
+ class ModelSDXL(ModelTemplate):
55
+ arch = "sdxl"
56
+ shape_fix = True
57
+ keys_detect = [
58
+ ("down_blocks.0.downsamplers.0.conv.weight", "add_embedding.linear_1.weight",),
59
+ (
60
+ "input_blocks.3.0.op.weight", "input_blocks.6.0.op.weight",
61
+ "output_blocks.2.2.conv.weight", "output_blocks.5.2.conv.weight",
62
+ ), # Non-diffusers
63
+ ("label_emb.0.0.weight",),
64
+ ]
65
+
66
+ class ModelSD1(ModelTemplate):
67
+ arch = "sd1"
68
+ shape_fix = True
69
+ keys_detect = [
70
+ ("down_blocks.0.downsamplers.0.conv.weight",),
71
+ (
72
+ "input_blocks.3.0.op.weight", "input_blocks.6.0.op.weight", "input_blocks.9.0.op.weight",
73
+ "output_blocks.2.1.conv.weight", "output_blocks.5.2.conv.weight", "output_blocks.8.2.conv.weight"
74
+ ), # Non-diffusers
75
+ ]
76
+
77
+ # The architectures are checked in order and the first successful match terminates the search.
78
+ arch_list = [ModelFlux, ModelSD3, ModelAura, ModelLTXV, ModelSDXL, ModelSD1]
79
+
80
+ def is_model_arch(model, state_dict):
81
+ # check if model is correct
82
+ matched = False
83
+ invalid = False
84
+ for match_list in model.keys_detect:
85
+ if all(key in state_dict for key in match_list):
86
+ matched = True
87
+ invalid = any(key in state_dict for key in model.keys_banned)
88
+ break
89
+ assert not invalid, "Model architecture not allowed for conversion! (i.e. reference VS diffusers format)"
90
+ return matched
91
+
92
+ def detect_arch(state_dict):
93
+ model_arch = None
94
+ for arch in arch_list:
95
+ if is_model_arch(arch, state_dict):
96
+ model_arch = arch
97
+ break
98
+ assert model_arch is not None, "Unknown model architecture!"
99
+ return model_arch
100
+
101
+ def parse_args():
102
+ parser = argparse.ArgumentParser(description="Generate F16 GGUF files from single UNET")
103
+ parser.add_argument("--src", required=True, help="Source model ckpt file.")
104
+ parser.add_argument("--dst", help="Output unet gguf file.")
105
+ args = parser.parse_args()
106
+
107
+ if not os.path.isfile(args.src):
108
+ parser.error("No input provided!")
109
+
110
+ return args
111
+
112
+ def load_state_dict(path):
113
+ if any(path.endswith(x) for x in [".ckpt", ".pt", ".bin", ".pth"]):
114
+ state_dict = torch.load(path, map_location="cpu", weights_only=True)
115
+ state_dict = state_dict.get("model", state_dict)
116
+ else:
117
+ state_dict = load_file(path)
118
+
119
+ # only keep unet with no prefix!
120
+ prefix = None
121
+ for pfx in ["model.diffusion_model.", "model."]:
122
+ if any([x.startswith(pfx) for x in state_dict.keys()]):
123
+ prefix = pfx
124
+ break
125
+
126
+ sd = {}
127
+ for k, v in state_dict.items():
128
+ if prefix and prefix not in k:
129
+ continue
130
+ if prefix:
131
+ k = k.replace(prefix, "")
132
+ sd[k] = v
133
+
134
+ return sd
135
+
136
+ def load_model(path):
137
+ state_dict = load_state_dict(path)
138
+ model_arch = detect_arch(state_dict)
139
+ print(f"* Architecture detected from input: {model_arch.arch}")
140
+ writer = gguf.GGUFWriter(path=None, arch=model_arch.arch)
141
+ return (writer, state_dict, model_arch)
142
+
143
+ def handle_tensors(args, writer, state_dict, model_arch):
144
+ name_lengths = tuple(sorted(
145
+ ((key, len(key)) for key in state_dict.keys()),
146
+ key=lambda item: item[1],
147
+ reverse=True,
148
+ ))
149
+ if not name_lengths:
150
+ return
151
+ max_name_len = name_lengths[0][1]
152
+ if max_name_len > MAX_TENSOR_NAME_LENGTH:
153
+ bad_list = ", ".join(f"{key!r} ({namelen})" for key, namelen in name_lengths if namelen > MAX_TENSOR_NAME_LENGTH)
154
+ raise ValueError(f"Can only handle tensor names up to {MAX_TENSOR_NAME_LENGTH} characters. Tensors exceeding the limit: {bad_list}")
155
+ for key, data in tqdm(state_dict.items()):
156
+ old_dtype = data.dtype
157
+
158
+ if data.dtype == torch.bfloat16:
159
+ data = data.to(torch.float32).numpy()
160
+ # this is so we don't break torch 2.0.X
161
+ elif data.dtype in [getattr(torch, "float8_e4m3fn", "_invalid"), getattr(torch, "float8_e5m2", "_invalid")]:
162
+ data = data.to(torch.float16).numpy()
163
+ else:
164
+ data = data.numpy()
165
+
166
+ n_dims = len(data.shape)
167
+ data_shape = data.shape
168
+ data_qtype = getattr(
169
+ gguf.GGMLQuantizationType,
170
+ "BF16" if old_dtype == torch.bfloat16 else "F16"
171
+ )
172
+
173
+ # get number of parameters (AKA elements) in this tensor
174
+ n_params = 1
175
+ for dim_size in data_shape:
176
+ n_params *= dim_size
177
+
178
+ # keys to keep as max precision
179
+ blacklist = {
180
+ "time_embedding.",
181
+ "add_embedding.",
182
+ "time_in.",
183
+ "txt_in.",
184
+ "vector_in.",
185
+ "img_in.",
186
+ "guidance_in.",
187
+ "final_layer.",
188
+ }
189
+
190
+ if old_dtype in (torch.float32, torch.bfloat16):
191
+ if n_dims == 1:
192
+ # one-dimensional tensors should be kept in F32
193
+ # also speeds up inference due to not dequantizing
194
+ data_qtype = gguf.GGMLQuantizationType.F32
195
+
196
+ elif n_params <= QUANTIZATION_THRESHOLD:
197
+ # very small tensors
198
+ data_qtype = gguf.GGMLQuantizationType.F32
199
+
200
+ elif ".weight" in key and any(x in key for x in blacklist):
201
+ data_qtype = gguf.GGMLQuantizationType.F32
202
+
203
+ if (model_arch.shape_fix # NEVER reshape for models such as flux
204
+ and n_dims > 1 # Skip one-dimensional tensors
205
+ and n_params >= REARRANGE_THRESHOLD # Only rearrange tensors meeting the size requirement
206
+ and (n_params / 256).is_integer() # Rearranging only makes sense if total elements is divisible by 256
207
+ and not (data.shape[-1] / 256).is_integer() # Only need to rearrange if the last dimension is not divisible by 256
208
+ ):
209
+ orig_shape = data.shape
210
+ data = data.reshape(n_params // 256, 256)
211
+ writer.add_array(f"comfy.gguf.orig_shape.{key}", tuple(int(dim) for dim in orig_shape))
212
+
213
+ try:
214
+ data = gguf.quants.quantize(data, data_qtype)
215
+ except (AttributeError, gguf.QuantError) as e:
216
+ tqdm.write(f"falling back to F16: {e}")
217
+ data_qtype = gguf.GGMLQuantizationType.F16
218
+ data = gguf.quants.quantize(data, data_qtype)
219
+
220
+ new_name = key # do we need to rename?
221
+
222
+ shape_str = f"{{{', '.join(str(n) for n in reversed(data.shape))}}}"
223
+ tqdm.write(f"{f'%-{max_name_len + 4}s' % f'{new_name}'} {old_dtype} --> {data_qtype.name}, shape = {shape_str}")
224
+
225
+ writer.add_tensor(new_name, data, raw_dtype=data_qtype)
226
+
227
+ if __name__ == "__main__":
228
+ args = parse_args()
229
+ path = args.src
230
+ writer, state_dict, model_arch = load_model(path)
231
+
232
+ writer.add_quantization_version(gguf.GGML_QUANT_VERSION)
233
+ if next(iter(state_dict.values())).dtype == torch.bfloat16:
234
+ out_path = f"{os.path.splitext(path)[0]}-BF16.gguf"
235
+ writer.add_file_type(gguf.LlamaFileType.MOSTLY_BF16)
236
+ else:
237
+ out_path = f"{os.path.splitext(path)[0]}-F16.gguf"
238
+ writer.add_file_type(gguf.LlamaFileType.MOSTLY_F16)
239
+
240
+ out_path = args.dst or out_path
241
+ if os.path.isfile(out_path):
242
+ input("Output exists enter to continue or ctrl+c to abort!")
243
+
244
+ handle_tensors(path, writer, state_dict, model_arch)
245
+ writer.write_header_to_file(path=out_path)
246
+ writer.write_kv_data_to_file()
247
+ writer.write_tensors_to_file(progress=True)
248
+ writer.close()
custom_nodes/ComfyUI-GGUF/tools/fix_lines_ending.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+
3
+ files = ["lcpp.patch", "lcpp_sd3.patch"]
4
+
5
+ def has_unix_line_endings(file_path):
6
+ try:
7
+ with open(file_path, 'rb') as file:
8
+ content = file.read()
9
+ return b'\r\n' not in content
10
+ except Exception as e:
11
+ print(f"Error checking '{file_path}': {e}")
12
+ return False
13
+
14
+ def convert_to_linux_format(file_path):
15
+ try:
16
+ with open(file_path, 'rb') as file:
17
+ content = file.read().replace(b'\r\n', b'\n')
18
+ with open(file_path, 'wb') as file:
19
+ file.write(content)
20
+ print(f"'{file_path}' converted to Linux line endings (LF).")
21
+ except Exception as e:
22
+ print(f"Error processing '{file_path}': {e}")
23
+
24
+ for file in files:
25
+ if os.path.exists(file):
26
+ if has_unix_line_endings(file):
27
+ print(f"'{file}' already has Unix line endings (LF). No conversion needed.")
28
+ else:
29
+ convert_to_linux_format(file)
30
+ else:
31
+ print(f"File '{file}' does not exist.")
custom_nodes/ComfyUI-GGUF/tools/lcpp.patch ADDED
@@ -0,0 +1,223 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
2
+ index 1d2a3540..b1a9ee96 100644
3
+ --- a/ggml/include/ggml.h
4
+ +++ b/ggml/include/ggml.h
5
+ @@ -230,7 +230,7 @@
6
+ #define GGML_MAX_CONTEXTS 64
7
+ #define GGML_MAX_SRC 10
8
+ #ifndef GGML_MAX_NAME
9
+ -#define GGML_MAX_NAME 64
10
+ +#define GGML_MAX_NAME 128
11
+ #endif
12
+ #define GGML_MAX_OP_PARAMS 64
13
+ #define GGML_DEFAULT_N_THREADS 4
14
+ diff --git a/src/llama.cpp b/src/llama.cpp
15
+ index 5ab65ea9..35580d9d 100644
16
+ --- a/src/llama.cpp
17
+ +++ b/src/llama.cpp
18
+ @@ -212,6 +212,9 @@ enum llm_arch {
19
+ LLM_ARCH_JAIS,
20
+ LLM_ARCH_NEMOTRON,
21
+ LLM_ARCH_EXAONE,
22
+ + LLM_ARCH_FLUX,
23
+ + LLM_ARCH_SD1,
24
+ + LLM_ARCH_SDXL,
25
+ LLM_ARCH_UNKNOWN,
26
+ };
27
+
28
+ @@ -259,6 +262,9 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
29
+ { LLM_ARCH_JAIS, "jais" },
30
+ { LLM_ARCH_NEMOTRON, "nemotron" },
31
+ { LLM_ARCH_EXAONE, "exaone" },
32
+ + { LLM_ARCH_FLUX, "flux" },
33
+ + { LLM_ARCH_SD1, "sd1" },
34
+ + { LLM_ARCH_SDXL, "sdxl" },
35
+ { LLM_ARCH_UNKNOWN, "(unknown)" },
36
+ };
37
+
38
+ @@ -1337,6 +1343,9 @@ static const std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NA
39
+ { LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
40
+ },
41
+ },
42
+ + { LLM_ARCH_FLUX, {}},
43
+ + { LLM_ARCH_SD1, {}},
44
+ + { LLM_ARCH_SDXL, {}},
45
+ {
46
+ LLM_ARCH_UNKNOWN,
47
+ {
48
+ @@ -4629,6 +4638,12 @@ static void llm_load_hparams(
49
+ // get general kv
50
+ ml.get_key(LLM_KV_GENERAL_NAME, model.name, false);
51
+
52
+ + // Disable LLM metadata for image models
53
+ + if (model.arch == LLM_ARCH_FLUX || model.arch == LLM_ARCH_SD1 || model.arch == LLM_ARCH_SDXL) {
54
+ + model.ftype = ml.ftype;
55
+ + return;
56
+ + }
57
+ +
58
+ // get hparams kv
59
+ ml.get_key(LLM_KV_VOCAB_SIZE, hparams.n_vocab, false) || ml.get_arr_n(LLM_KV_TOKENIZER_LIST, hparams.n_vocab);
60
+
61
+ @@ -15827,11 +15842,162 @@ static void llama_tensor_dequantize_internal(
62
+ workers.clear();
63
+ }
64
+
65
+ +static ggml_type img_tensor_get_type(quantize_state_internal & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
66
+ + // Special function for quantizing image model tensors
67
+ + const std::string name = ggml_get_name(tensor);
68
+ + const llm_arch arch = qs.model.arch;
69
+ +
70
+ + // Sanity check
71
+ + if (
72
+ + (name.find("model.diffusion_model.") != std::string::npos) ||
73
+ + (name.find("first_stage_model.") != std::string::npos) ||
74
+ + (name.find("single_transformer_blocks.") != std::string::npos)
75
+ + ) {
76
+ + throw std::runtime_error("Invalid input GGUF file. This is not a supported UNET model");
77
+ + }
78
+ +
79
+ + // Unsupported quant types - exclude all IQ quants for now
80
+ + if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS ||
81
+ + ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ||
82
+ + ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S ||
83
+ + ftype == LLAMA_FTYPE_MOSTLY_IQ1_M || ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL ||
84
+ + ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_S ||
85
+ + ftype == LLAMA_FTYPE_MOSTLY_IQ3_M || ftype == LLAMA_FTYPE_MOSTLY_Q4_0_4_4 ||
86
+ + ftype == LLAMA_FTYPE_MOSTLY_Q4_0_4_8 || ftype == LLAMA_FTYPE_MOSTLY_Q4_0_8_8) {
87
+ + throw std::runtime_error("Invalid quantization type for image model (Not supported)");
88
+ + }
89
+ +
90
+ + if ( // Tensors to keep in FP32 precision
91
+ + (arch == LLM_ARCH_FLUX) && (
92
+ + (name.find("img_in.") != std::string::npos) ||
93
+ + (name.find("time_in.in_layer.") != std::string::npos) ||
94
+ + (name.find("vector_in.in_layer.") != std::string::npos) ||
95
+ + (name.find("guidance_in.in_layer.") != std::string::npos) ||
96
+ + (name.find("final_layer.linear.") != std::string::npos)
97
+ + ) || (arch == LLM_ARCH_SD1 || arch == LLM_ARCH_SDXL) && (
98
+ + (name.find("conv_in.") != std::string::npos) ||
99
+ + (name.find("conv_out.") != std::string::npos) ||
100
+ + (name == "input_blocks.0.0.weight") ||
101
+ + (name == "out.2.weight")
102
+ + )) {
103
+ + new_type = GGML_TYPE_F32;
104
+ + } else if ( // Tensors to keep in FP16 precision
105
+ + (arch == LLM_ARCH_FLUX) && (
106
+ + (name.find("txt_in.") != std::string::npos) ||
107
+ + (name.find("time_in.") != std::string::npos) ||
108
+ + (name.find("vector_in.") != std::string::npos) ||
109
+ + (name.find("guidance_in.") != std::string::npos) ||
110
+ + (name.find("final_layer.") != std::string::npos)
111
+ + ) || (arch == LLM_ARCH_SD1 || arch == LLM_ARCH_SDXL) && (
112
+ + (name.find("class_embedding.") != std::string::npos) ||
113
+ + (name.find("time_embedding.") != std::string::npos) ||
114
+ + (name.find("add_embedding.") != std::string::npos) ||
115
+ + (name.find("time_embed.") != std::string::npos) ||
116
+ + (name.find("label_emb.") != std::string::npos) ||
117
+ + (name.find("proj_in.") != std::string::npos) ||
118
+ + (name.find("proj_out.") != std::string::npos)
119
+ + // (name.find("conv_shortcut.") != std::string::npos) // marginal improvement
120
+ + )) {
121
+ + new_type = GGML_TYPE_F16;
122
+ + } else if ( // Rules for to_v attention
123
+ + (name.find("attn_v.weight") != std::string::npos) ||
124
+ + (name.find(".to_v.weight") != std::string::npos)
125
+ + ){
126
+ + if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) {
127
+ + new_type = GGML_TYPE_Q3_K;
128
+ + }
129
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
130
+ + new_type = qs.i_attention_wv < 2 ? GGML_TYPE_Q5_K : GGML_TYPE_Q4_K;
131
+ + }
132
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
133
+ + new_type = GGML_TYPE_Q5_K;
134
+ + }
135
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
136
+ + new_type = GGML_TYPE_Q6_K;
137
+ + }
138
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S && qs.i_attention_wv < 4) {
139
+ + new_type = GGML_TYPE_Q5_K;
140
+ + }
141
+ + ++qs.i_attention_wv;
142
+ + } else if ( // Rules for fused qkv attention
143
+ + (name.find("attn_qkv.weight") != std::string::npos) ||
144
+ + (name.find("attn.qkv.weight") != std::string::npos)
145
+ + ) {
146
+ + if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
147
+ + new_type = GGML_TYPE_Q4_K;
148
+ + }
149
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
150
+ + new_type = GGML_TYPE_Q5_K;
151
+ + }
152
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
153
+ + new_type = GGML_TYPE_Q6_K;
154
+ + }
155
+ + } else if ( // Rules for ffn
156
+ + (name.find("ffn_down") != std::string::npos) ||
157
+ + (name.find("DenseReluDense.wo") != std::string::npos)
158
+ + ) {
159
+ + // TODO: add back `layer_info` with some model specific logic + logic further down
160
+ + if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
161
+ + new_type = GGML_TYPE_Q4_K;
162
+ + }
163
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
164
+ + new_type = GGML_TYPE_Q5_K;
165
+ + }
166
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S) {
167
+ + new_type = GGML_TYPE_Q5_K;
168
+ + }
169
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
170
+ + new_type = GGML_TYPE_Q6_K;
171
+ + }
172
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
173
+ + new_type = GGML_TYPE_Q6_K;
174
+ + }
175
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_0) {
176
+ + new_type = GGML_TYPE_Q4_1;
177
+ + }
178
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_0) {
179
+ + new_type = GGML_TYPE_Q5_1;
180
+ + }
181
+ + ++qs.i_ffn_down;
182
+ + }
183
+ +
184
+ + // Sanity check for row shape
185
+ + bool convert_incompatible_tensor = false;
186
+ + if (new_type == GGML_TYPE_Q2_K || new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K ||
187
+ + new_type == GGML_TYPE_Q5_K || new_type == GGML_TYPE_Q6_K) {
188
+ + int nx = tensor->ne[0];
189
+ + int ny = tensor->ne[1];
190
+ + if (nx % QK_K != 0) {
191
+ + LLAMA_LOG_WARN("\n\n%s : tensor cols %d x %d are not divisible by %d, required for %s", __func__, nx, ny, QK_K, ggml_type_name(new_type));
192
+ + convert_incompatible_tensor = true;
193
+ + } else {
194
+ + ++qs.n_k_quantized;
195
+ + }
196
+ + }
197
+ + if (convert_incompatible_tensor) {
198
+ + // TODO: Possibly reenable this in the future
199
+ + // switch (new_type) {
200
+ + // case GGML_TYPE_Q2_K:
201
+ + // case GGML_TYPE_Q3_K:
202
+ + // case GGML_TYPE_Q4_K: new_type = GGML_TYPE_Q5_0; break;
203
+ + // case GGML_TYPE_Q5_K: new_type = GGML_TYPE_Q5_1; break;
204
+ + // case GGML_TYPE_Q6_K: new_type = GGML_TYPE_Q8_0; break;
205
+ + // default: throw std::runtime_error("\nUnsupported tensor size encountered\n");
206
+ + // }
207
+ + new_type = GGML_TYPE_F16;
208
+ + LLAMA_LOG_WARN(" - using fallback quantization %s\n", ggml_type_name(new_type));
209
+ + ++qs.n_fallback;
210
+ + }
211
+ + return new_type;
212
+ +}
213
+ +
214
+ +
215
+ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
216
+ const std::string name = ggml_get_name(tensor);
217
+
218
+ // TODO: avoid hardcoded tensor names - use the TN_* constants
219
+ const llm_arch arch = qs.model.arch;
220
+ + if (arch == LLM_ARCH_FLUX || arch == LLM_ARCH_SD1 || arch == LLM_ARCH_SDXL) { return img_tensor_get_type(qs, new_type, tensor, ftype); };
221
+ const auto tn = LLM_TN(arch);
222
+
223
+ auto use_more_bits = [](int i_layer, int n_layers) -> bool {
custom_nodes/ComfyUI-GGUF/tools/lcpp_sd3.patch ADDED
@@ -0,0 +1,324 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
2
+ index de3c706f..0267c1fa 100644
3
+ --- a/ggml/include/ggml.h
4
+ +++ b/ggml/include/ggml.h
5
+ @@ -223,7 +223,7 @@
6
+ #define GGML_MAX_OP_PARAMS 64
7
+
8
+ #ifndef GGML_MAX_NAME
9
+ -# define GGML_MAX_NAME 64
10
+ +# define GGML_MAX_NAME 128
11
+ #endif
12
+
13
+ #define GGML_DEFAULT_N_THREADS 4
14
+ @@ -2449,6 +2449,7 @@ extern "C" {
15
+
16
+ // manage tensor info
17
+ GGML_API void gguf_add_tensor(struct gguf_context * ctx, const struct ggml_tensor * tensor);
18
+ + GGML_API void gguf_set_tensor_ndim(struct gguf_context * ctx, const char * name, int n_dim);
19
+ GGML_API void gguf_set_tensor_type(struct gguf_context * ctx, const char * name, enum ggml_type type);
20
+ GGML_API void gguf_set_tensor_data(struct gguf_context * ctx, const char * name, const void * data, size_t size);
21
+
22
+ diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
23
+ index b16c462f..6d1568f1 100644
24
+ --- a/ggml/src/ggml.c
25
+ +++ b/ggml/src/ggml.c
26
+ @@ -22960,6 +22960,14 @@ void gguf_add_tensor(
27
+ ctx->header.n_tensors++;
28
+ }
29
+
30
+ +void gguf_set_tensor_ndim(struct gguf_context * ctx, const char * name, const int n_dim) {
31
+ + const int idx = gguf_find_tensor(ctx, name);
32
+ + if (idx < 0) {
33
+ + GGML_ABORT("tensor not found");
34
+ + }
35
+ + ctx->infos[idx].n_dims = n_dim;
36
+ +}
37
+ +
38
+ void gguf_set_tensor_type(struct gguf_context * ctx, const char * name, enum ggml_type type) {
39
+ const int idx = gguf_find_tensor(ctx, name);
40
+ if (idx < 0) {
41
+ diff --git a/src/llama.cpp b/src/llama.cpp
42
+ index 24e1f1f0..aeccc173 100644
43
+ --- a/src/llama.cpp
44
+ +++ b/src/llama.cpp
45
+ @@ -205,6 +205,11 @@ enum llm_arch {
46
+ LLM_ARCH_GRANITE,
47
+ LLM_ARCH_GRANITE_MOE,
48
+ LLM_ARCH_CHAMELEON,
49
+ + LLM_ARCH_FLUX,
50
+ + LLM_ARCH_SD1,
51
+ + LLM_ARCH_SDXL,
52
+ + LLM_ARCH_SD3,
53
+ + LLM_ARCH_AURA,
54
+ LLM_ARCH_UNKNOWN,
55
+ };
56
+
57
+ @@ -258,6 +263,11 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
58
+ { LLM_ARCH_GRANITE, "granite" },
59
+ { LLM_ARCH_GRANITE_MOE, "granitemoe" },
60
+ { LLM_ARCH_CHAMELEON, "chameleon" },
61
+ + { LLM_ARCH_FLUX, "flux" },
62
+ + { LLM_ARCH_SD1, "sd1" },
63
+ + { LLM_ARCH_SDXL, "sdxl" },
64
+ + { LLM_ARCH_SD3, "sd3" },
65
+ + { LLM_ARCH_AURA, "aura" },
66
+ { LLM_ARCH_UNKNOWN, "(unknown)" },
67
+ };
68
+
69
+ @@ -1531,6 +1541,11 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
70
+ { LLM_TENSOR_ATTN_K_NORM, "blk.%d.attn_k_norm" },
71
+ },
72
+ },
73
+ + { LLM_ARCH_FLUX, {}},
74
+ + { LLM_ARCH_SD1, {}},
75
+ + { LLM_ARCH_SDXL, {}},
76
+ + { LLM_ARCH_SD3, {}},
77
+ + { LLM_ARCH_AURA, {}},
78
+ {
79
+ LLM_ARCH_UNKNOWN,
80
+ {
81
+ @@ -5403,6 +5418,12 @@ static void llm_load_hparams(
82
+ // get general kv
83
+ ml.get_key(LLM_KV_GENERAL_NAME, model.name, false);
84
+
85
+ + // Disable LLM metadata for image models
86
+ + if (model.arch == LLM_ARCH_FLUX || model.arch == LLM_ARCH_SD1 || model.arch == LLM_ARCH_SDXL || model.arch == LLM_ARCH_SD3 || model.arch == LLM_ARCH_AURA) {
87
+ + model.ftype = ml.ftype;
88
+ + return;
89
+ + }
90
+ +
91
+ // get hparams kv
92
+ ml.get_key(LLM_KV_VOCAB_SIZE, hparams.n_vocab, false) || ml.get_arr_n(LLM_KV_TOKENIZER_LIST, hparams.n_vocab);
93
+
94
+ @@ -18016,6 +18037,125 @@ static void llama_tensor_dequantize_internal(
95
+ workers.clear();
96
+ }
97
+
98
+ +static ggml_type img_tensor_get_type(quantize_state_internal & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
99
+ + // Special function for quantizing image model tensors
100
+ + const std::string name = ggml_get_name(tensor);
101
+ + const llm_arch arch = qs.model.arch;
102
+ +
103
+ + // Sanity check
104
+ + if (
105
+ + (name.find("model.diffusion_model.") != std::string::npos) ||
106
+ + (name.find("first_stage_model.") != std::string::npos) ||
107
+ + (name.find("single_transformer_blocks.") != std::string::npos) ||
108
+ + (name.find("joint_transformer_blocks.") != std::string::npos)
109
+ + ) {
110
+ + throw std::runtime_error("Invalid input GGUF file. This is not a supported UNET model");
111
+ + }
112
+ +
113
+ + // Unsupported quant types - exclude all IQ quants for now
114
+ + if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS ||
115
+ + ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ||
116
+ + ftype == LLAMA_FTYPE_MOSTLY_IQ3_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S ||
117
+ + ftype == LLAMA_FTYPE_MOSTLY_IQ1_M || ftype == LLAMA_FTYPE_MOSTLY_IQ4_NL ||
118
+ + ftype == LLAMA_FTYPE_MOSTLY_IQ4_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ3_S ||
119
+ + ftype == LLAMA_FTYPE_MOSTLY_IQ3_M || ftype == LLAMA_FTYPE_MOSTLY_Q4_0_4_4 ||
120
+ + ftype == LLAMA_FTYPE_MOSTLY_Q4_0_4_8 || ftype == LLAMA_FTYPE_MOSTLY_Q4_0_8_8) {
121
+ + throw std::runtime_error("Invalid quantization type for image model (Not supported)");
122
+ + }
123
+ +
124
+ + if ( // Rules for to_v attention
125
+ + (name.find("attn_v.weight") != std::string::npos) ||
126
+ + (name.find(".to_v.weight") != std::string::npos) ||
127
+ + (name.find(".attn.w1v.weight") != std::string::npos) ||
128
+ + (name.find(".attn.w2v.weight") != std::string::npos)
129
+ + ){
130
+ + if (ftype == LLAMA_FTYPE_MOSTLY_Q2_K) {
131
+ + new_type = GGML_TYPE_Q3_K;
132
+ + }
133
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
134
+ + new_type = qs.i_attention_wv < 2 ? GGML_TYPE_Q5_K : GGML_TYPE_Q4_K;
135
+ + }
136
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
137
+ + new_type = GGML_TYPE_Q5_K;
138
+ + }
139
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
140
+ + new_type = GGML_TYPE_Q6_K;
141
+ + }
142
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S && qs.i_attention_wv < 4) {
143
+ + new_type = GGML_TYPE_Q5_K;
144
+ + }
145
+ + ++qs.i_attention_wv;
146
+ + } else if ( // Rules for fused qkv attention
147
+ + (name.find("attn_qkv.weight") != std::string::npos) ||
148
+ + (name.find("attn.qkv.weight") != std::string::npos)
149
+ + ) {
150
+ + if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
151
+ + new_type = GGML_TYPE_Q4_K;
152
+ + }
153
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
154
+ + new_type = GGML_TYPE_Q5_K;
155
+ + }
156
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
157
+ + new_type = GGML_TYPE_Q6_K;
158
+ + }
159
+ + } else if ( // Rules for ffn
160
+ + (name.find("ffn_down") != std::string::npos)
161
+ + ) {
162
+ + // TODO: add back `layer_info` with some model specific logic + logic further down
163
+ + if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_M) {
164
+ + new_type = GGML_TYPE_Q4_K;
165
+ + }
166
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q3_K_L) {
167
+ + new_type = GGML_TYPE_Q5_K;
168
+ + }
169
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_S) {
170
+ + new_type = GGML_TYPE_Q5_K;
171
+ + }
172
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
173
+ + new_type = GGML_TYPE_Q6_K;
174
+ + }
175
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
176
+ + new_type = GGML_TYPE_Q6_K;
177
+ + }
178
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q4_0) {
179
+ + new_type = GGML_TYPE_Q4_1;
180
+ + }
181
+ + else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_0) {
182
+ + new_type = GGML_TYPE_Q5_1;
183
+ + }
184
+ + ++qs.i_ffn_down;
185
+ + }
186
+ +
187
+ + // Sanity check for row shape
188
+ + bool convert_incompatible_tensor = false;
189
+ + if (new_type == GGML_TYPE_Q2_K || new_type == GGML_TYPE_Q3_K || new_type == GGML_TYPE_Q4_K ||
190
+ + new_type == GGML_TYPE_Q5_K || new_type == GGML_TYPE_Q6_K) {
191
+ + int nx = tensor->ne[0];
192
+ + int ny = tensor->ne[1];
193
+ + if (nx % QK_K != 0) {
194
+ + LLAMA_LOG_WARN("\n\n%s : tensor cols %d x %d are not divisible by %d, required for %s", __func__, nx, ny, QK_K, ggml_type_name(new_type));
195
+ + convert_incompatible_tensor = true;
196
+ + } else {
197
+ + ++qs.n_k_quantized;
198
+ + }
199
+ + }
200
+ + if (convert_incompatible_tensor) {
201
+ + // TODO: Possibly reenable this in the future
202
+ + // switch (new_type) {
203
+ + // case GGML_TYPE_Q2_K:
204
+ + // case GGML_TYPE_Q3_K:
205
+ + // case GGML_TYPE_Q4_K: new_type = GGML_TYPE_Q5_0; break;
206
+ + // case GGML_TYPE_Q5_K: new_type = GGML_TYPE_Q5_1; break;
207
+ + // case GGML_TYPE_Q6_K: new_type = GGML_TYPE_Q8_0; break;
208
+ + // default: throw std::runtime_error("\nUnsupported tensor size encountered\n");
209
+ + // }
210
+ + new_type = GGML_TYPE_F16;
211
+ + LLAMA_LOG_WARN(" - using fallback quantization %s\n", ggml_type_name(new_type));
212
+ + ++qs.n_fallback;
213
+ + }
214
+ + return new_type;
215
+ +}
216
+ +
217
+ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
218
+ const std::string name = ggml_get_name(tensor);
219
+
220
+ @@ -18547,6 +18687,29 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
221
+ ctx_outs[i_split] = gguf_init_empty();
222
+ }
223
+ gguf_add_tensor(ctx_outs[i_split], tensor);
224
+ + // SD3 pos_embed needs special fix as first dim is 1, which gets truncated here
225
+ + if (model.arch == LLM_ARCH_SD3) {
226
+ + const std::string name = ggml_get_name(tensor);
227
+ + if (name == "pos_embed" && tensor->ne[2] == 1) {
228
+ + const int n_dim = 3;
229
+ + gguf_set_tensor_ndim(ctx_outs[i_split], "pos_embed", n_dim);
230
+ + LLAMA_LOG_INFO("\n%s: Correcting pos_embed shape for SD3: [key:%s]\n", __func__, tensor->name);
231
+ + }
232
+ + }
233
+ + // same goes for auraflow
234
+ + if (model.arch == LLM_ARCH_AURA) {
235
+ + const std::string name = ggml_get_name(tensor);
236
+ + if (name == "positional_encoding" && tensor->ne[2] == 1) {
237
+ + const int n_dim = 3;
238
+ + gguf_set_tensor_ndim(ctx_outs[i_split], "positional_encoding", n_dim);
239
+ + LLAMA_LOG_INFO("\n%s: Correcting positional_encoding shape for AuraFlow: [key:%s]\n", __func__, tensor->name);
240
+ + }
241
+ + if (name == "register_tokens" && tensor->ne[2] == 1) {
242
+ + const int n_dim = 3;
243
+ + gguf_set_tensor_ndim(ctx_outs[i_split], "register_tokens", n_dim);
244
+ + LLAMA_LOG_INFO("\n%s: Correcting register_tokens shape for AuraFlow: [key:%s]\n", __func__, tensor->name);
245
+ + }
246
+ + }
247
+ }
248
+
249
+ // Set split info if needed
250
+ @@ -18647,6 +18810,56 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
251
+ // do not quantize relative position bias (T5)
252
+ quantize &= name.find("attn_rel_b.weight") == std::string::npos;
253
+
254
+ + // rules for image models
255
+ + bool image_model = false;
256
+ + if (model.arch == LLM_ARCH_FLUX) {
257
+ + image_model = true;
258
+ + quantize &= name.find("txt_in.") == std::string::npos;
259
+ + quantize &= name.find("img_in.") == std::string::npos;
260
+ + quantize &= name.find("time_in.") == std::string::npos;
261
+ + quantize &= name.find("vector_in.") == std::string::npos;
262
+ + quantize &= name.find("guidance_in.") == std::string::npos;
263
+ + quantize &= name.find("final_layer.") == std::string::npos;
264
+ + }
265
+ + if (model.arch == LLM_ARCH_SD1 || model.arch == LLM_ARCH_SDXL) {
266
+ + image_model = true;
267
+ + quantize &= name.find("class_embedding.") == std::string::npos;
268
+ + quantize &= name.find("time_embedding.") == std::string::npos;
269
+ + quantize &= name.find("add_embedding.") == std::string::npos;
270
+ + quantize &= name.find("time_embed.") == std::string::npos;
271
+ + quantize &= name.find("label_emb.") == std::string::npos;
272
+ + quantize &= name.find("conv_in.") == std::string::npos;
273
+ + quantize &= name.find("conv_out.") == std::string::npos;
274
+ + quantize &= name != "input_blocks.0.0.weight";
275
+ + quantize &= name != "out.2.weight";
276
+ + }
277
+ + if (model.arch == LLM_ARCH_SD3) {
278
+ + image_model = true;
279
+ + quantize &= name.find("final_layer.") == std::string::npos;
280
+ + quantize &= name.find("time_text_embed.") == std::string::npos;
281
+ + quantize &= name.find("context_embedder.") == std::string::npos;
282
+ + quantize &= name.find("t_embedder.") == std::string::npos;
283
+ + quantize &= name.find("y_embedder.") == std::string::npos;
284
+ + quantize &= name.find("x_embedder.") == std::string::npos;
285
+ + quantize &= name != "proj_out.weight";
286
+ + quantize &= name != "pos_embed";
287
+ + }
288
+ + if (model.arch == LLM_ARCH_AURA) {
289
+ + image_model = true;
290
+ + quantize &= name.find("t_embedder.") == std::string::npos;
291
+ + quantize &= name.find("init_x_linear.") == std::string::npos;
292
+ + quantize &= name != "modF.1.weight";
293
+ + quantize &= name != "cond_seq_linear.weight";
294
+ + quantize &= name != "final_linear.weight";
295
+ + quantize &= name != "final_linear.weight";
296
+ + quantize &= name != "positional_encoding";
297
+ + quantize &= name != "register_tokens";
298
+ + }
299
+ + // ignore 3D/4D tensors for image models as the code was never meant to handle these
300
+ + if (image_model) {
301
+ + quantize &= ggml_n_dims(tensor) == 2;
302
+ + }
303
+ +
304
+ enum ggml_type new_type;
305
+ void * new_data;
306
+ size_t new_size;
307
+ @@ -18655,6 +18868,9 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
308
+ new_type = default_type;
309
+
310
+ // get more optimal quantization type based on the tensor shape, layer, etc.
311
+ + if (image_model) {
312
+ + new_type = img_tensor_get_type(qs, new_type, tensor, ftype);
313
+ + } else {
314
+ if (!params->pure && ggml_is_quantized(default_type)) {
315
+ new_type = llama_tensor_get_type(qs, new_type, tensor, ftype);
316
+ }
317
+ @@ -18664,6 +18880,7 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s
318
+ if (params->output_tensor_type < GGML_TYPE_COUNT && strcmp(tensor->name, "output.weight") == 0) {
319
+ new_type = params->output_tensor_type;
320
+ }
321
+ + }
322
+
323
+ // If we've decided to quantize to the same type the tensor is already
324
+ // in then there's nothing to do.
custom_nodes/ComfyUI-GGUF/tools/read_tensors.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/python3
2
+ import os
3
+ import sys
4
+ import gguf
5
+
6
+ def read_tensors(path):
7
+ reader = gguf.GGUFReader(path)
8
+ for tensor in reader.tensors:
9
+ if tensor.tensor_type == gguf.GGMLQuantizationType.F32:
10
+ continue
11
+ print(f"{str(tensor.tensor_type):32}: {tensor.name}")
12
+
13
+ try:
14
+ path = sys.argv[1]
15
+ assert os.path.isfile(path), "Invalid path"
16
+ print(f"input: {path}")
17
+ except Exception as e:
18
+ input(f"failed: {e}")
19
+ else:
20
+ read_tensors(path)
21
+ input()