Wolfram Ravenwolf
wolfram
AI & ML interests
Local LLMs
Recent Activity
posted
an
update
about 16 hours ago
Finally finished my extensive **Qwen 3 evaluations** across a range of formats and quantisations, focusing on **MMLU-Pro** (Computer Science).
A few take-aways stood out - especially for those interested in local deployment and performance trade-offs:
1️⃣ **Qwen3-235B-A22B** (via Fireworks API) tops the table at **83.66%** with ~55 tok/s.
2️⃣ But the **30B-A3B Unsloth** quant delivered **82.20%** while running locally at ~45 tok/s and with zero API spend.
3️⃣ The same Unsloth build is ~5x faster than Qwen's **Qwen3-32B**, which scores **82.20%** as well yet crawls at <10 tok/s.
4️⃣ On Apple silicon, the **30B MLX** port hits **79.51%** while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups.
5️⃣ The **0.6B** micro-model races above 180 tok/s but tops out at **37.56%** - that's why it's not even on the graph (50 % performance cut-off).
All local runs were done with LM Studio on an M4 MacBook Pro, using Qwen's official recommended settings.
**Conclusion:** Quantised 30B models now get you ~98 % of frontier-class accuracy - at a fraction of the latency, cost, and energy. For most local RAG or agent workloads, they're not just good enough - they're the new default.
Well done, Qwen - you really whipped the llama's ass! And to OpenAI: for your upcoming open model, please make it MoE, with toggleable reasoning, and release it in many sizes. *This* is the future!
new activity
about 16 hours ago
mlx-community/Qwen3-30B-A3B-4bit:Jinja chat template error on lmstudio
new activity
8 days ago
wolfram/Athene-V2-Chat-4.65bpw-h6-exl2:Improve language tag
Organizations
wolfram's activity
Jinja chat template error on lmstudio
8
#1 opened 9 days ago
by
furyzhenxi
Improve language tag
1
#1 opened 10 days ago
by
lbourdois

[Support] Community Articles
1
83
#5 opened about 1 year ago
by
victor

The tokenizer has changed just fyi
12
#2 opened 10 months ago
by
bullerwins

no system message?
9
#14 opened 12 months ago
by
mclassHF2023
Concerns regarding Prompt Format
6
#1 opened about 1 year ago
by
wolfram

Strange observation: model becomes super horny in ST's MinP mode
5
#7 opened about 1 year ago
by
deleted
Upload folder using huggingface_hub
2
#3 opened about 1 year ago
by
wolfram

VRAM Estimates
6
#3 opened about 1 year ago
by
ernestr

Merge method
1
#4 opened about 1 year ago
by
dnhkng

Can't wait to test
5
#4 opened about 1 year ago
by
froggeric

Kindly asking for quants
7
#2 opened about 1 year ago
by
wolfram

GPTQ / AWQ
1
#2 opened about 1 year ago
by
agahebr
Guidance on GPU VRAM Split?
5
#3 opened over 1 year ago
by
nmitchko
Upload folder using huggingface_hub
2
#1 opened about 1 year ago
by
wolfram

Very interesting that miqu will give 16k context work even only first layer and last layer
1
15
#2 opened about 1 year ago
by
akoyaki
iMatrix, IQ2_XS & IQ2_XXS
13
#2 opened about 1 year ago
by
Nexesenex

Performance
13
#2 opened over 1 year ago
by
KnutJaegersberg

VRAM requirements
9
#1 opened about 1 year ago
by
sophosympatheia

benchmarks?
1
#1 opened over 1 year ago
by
distantquant
