13 6 18

Emin Temiz PRO

etemiz

https://pickabrain.ai

AI & ML interests

Alignment

Recent Activity

replied to clem's post about 17 hours ago

What are you using to evaluate models or AI systems? So far we're building lighteval & leaderboards on the hub but still feels early & a lot more to build. What would be useful to you?

replied to their post 6 days ago

Qwen 3 numbers are in! They did a good job this time, compared to 2.5 and QwQ numbers are a lot better. I used 2 GGUFs for this, one from LMStudio and one from Unsloth. Number of parameters: 235B A22B. The first one is Q4. Second one is Q8. The LLMs that did the comparison are the same, Llama 3.1 70B and Gemma 3 27B. So I took 2*2 = 4 measurements for each column and took average of measurements. My leaderboard is pretty unrelated to others it seems. Valuable in that sense, it is another non-mainstream angle for model evaluation. More info: https://huggingface.co/blog/etemiz/aha-leaderboard

posted an update 6 days ago

View all activity

Organizations

None yet

etemiz's activity

replied to clem's post about 17 hours ago

I call mine Artificial Human Alignment but it could also be called liberating knowledge. Humans want to live free and happy and healthy.

https://huggingface.co/blog/etemiz/aha-leaderboard

replied to their post 6 days ago

I think my leaderboard can be used for p(doom)!

Lets say highest scores around 50 corresponds to p(doom) = 0.1
And say lowest scores around 20 corresponds to p(doom) = 0.5

Last three models that I measured are Grok 3, Llama 4 Maverick and Qwen 3. Scores are 42, 45, 41. So based on last 3 measurements average is 42.66. Mapping this to the scale above between 20 and 50:

(50-42.66)/(50-20)=0.24

mapping this to the probability domain:

(0.5-0.1)*0.24 + 0.1=0.196

So probability of doom is ~20%

If models are released that score high in my leaderboard, p(doom) will reduce. If models are released that score low in my leaderboard, p(doom) will increase.

posted an update 6 days ago

Post

1055

Qwen 3 numbers are in! They did a good job this time, compared to 2.5 and QwQ numbers are a lot better.

I used 2 GGUFs for this, one from LMStudio and one from Unsloth. Number of parameters: 235B A22B. The first one is Q4. Second one is Q8.

The LLMs that did the comparison are the same, Llama 3.1 70B and Gemma 3 27B.

So I took 2*2 = 4 measurements for each column and took average of measurements.

My leaderboard is pretty unrelated to others it seems. Valuable in that sense, it is another non-mainstream angle for model evaluation.

More info: https://huggingface.co/blog/etemiz/aha-leaderboard

1 reply

reacted to Kseniase's post with ❤️ 9 days ago

Post

6408

6 Free resources on Reinforcement Learning (RL)

RL now is where the real action is, it's the engine behind autonomous tech, robots, and the next wave of AI that thinks, moves and solves problems on its own. To stay up to date with what’s happening in RL, we offer some fresh materials on it:

1. "Reinforcement Learning from Human Feedback" by Nathan Lambert -> https://rlhfbook.com/
It's a short introduction to RLHF, explaining instruction tuning, reward modeling, alignment methods, synthetic data, evaluation, and more

2. "A Course in Reinforcement Learning (2nd Edition)" by Dimitri P. Bertsekas -> https://www.mit.edu/~dimitrib/RLbook.html
Explains dynamic programming (DP) and RL, diving into rollout algorithms, neural networks, policy learning, etc. It’s packed with solved exercises and real-world examples

3. "Mathematical Foundations of Reinforcement Learning" video course by Shiyu Zhao -> https://www.youtube.com/playlist?list=PLEhdbSEZZbDaFWPX4gehhwB9vJZJ1DNm8
Offers a mathematical yet friendly introduction to RL, covering Bellman Equation, value iteration, Monte Carlo learning, approximation, policy gradient, actor-critic methods, etc.
+ Check out the repo for more: https://github.com/MathFoundationRL/Book-Mathematical-Foundation-of-Reinforcement-Learning

4. "Multi-Agent Reinforcement Learning" by Stefano V. Albrecht, Filippos Christianos, and Lukas Schäfer -> https://www.marl-book.com/
Covers models, core ideas of multi-agent RL (MARL) and modern approaches to combining it with deep learning

5. "Reinforcement Learning: A Comprehensive Overview" by Kevin P. Murphy -> https://arxiv.org/pdf/2412.05265
Explains RL and sequential decision making, covering value-based, policy-gradient, model-based, multi-agent RL methods, RL+LLMs, and RL+inference and other topics

6. Our collection of free courses and books on RL -> https://huggingface.co/posts/Kseniase/884818121094439

If you liked this, also subscribe to The Turing Post: https://www.turingpost.com/subscribe

posted an update 15 days ago

Post

555

According to the paper below, when you fine tune a model with harmful code, it turns evil in other areas.
https://arxiv.org/abs/2502.17424

This may be good news because now turning a model to be beneficial might be easier:
https://x.com/ESYudkowsky/status/1894453376215388644

Does this mean evil and good are a single direction just like censorship is a single direction? So in theory one can make a model good doing an abliteration like operation?

1 reply

posted an update 16 days ago

Post

2269

Llama 4 Maverick got worse scores than Llama 3.1 405B in human alignment.

I used CPU for inferencing from this size of a model (402B), and it ran fast. Being a mixture of experts it may be useful for CPU inference and having a big context useful for RAG. For beneficial answers there are other alternatives.

Still it managed to beat Grok 3. I had so much expectations for Grok 3 because X is holding more beneficial ideas in my opinion.

It got worse health scores compared to 3.1 and better bitcoin scores. I could post some comparisons of answers between the two. With which model should I publish comparisons? Llama 3.1 or Grok 3 or something else?

https://sheet.zohopublic.com/sheet/published/mz41j09cc640a29ba47729fed784a263c1d08

posted an update 22 days ago

Post

1612

Grok 3 Human Alignment Score: 42

It is better in health, nutrition, fasting compared to Grok 2. About the same in liberating tech like bitcoin and nostr. Worse in the misinformation and faith domains. The rest is about the same. So we have a model that is less faithful but knows how to live a healthier life.

https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08?sheetid=0&range=A1

https://huggingface.co/blog/etemiz/benchmarking-ai-human-alignment-of-grok-3

replied to Dragunflie-420's post 25 days ago

Have you researched MUDs? It may be easier to code, like doing modifications to a text file. Obviously it won't have graphics but your grandson may use his own imagination!

replied to their post 25 days ago

https://www.reddit.com/r/LocalLLaMA/comments/1jufqbn/qwen3_pull_request_sent_to_llamacpp/

replied to their post 26 days ago

I don't think it is too much random clicking. There is legitimacy to it.

I also think small portion of the data should be public. If any auditor wants, they can get a bigger portion of the data. LLM builders should not get all the data, thats for sure. I will try to do that for my leaderboard, a gradient of openness for different actors.

posted an update 27 days ago

Post

2175

It looks like Llama 4 team gamed the LMArena benchmarks by making their Maverick model output emojis, longer responses and ultra high enthusiasm! Is that ethical or not? They could certainly do a better job by working with teams like llama.cpp, just like Qwen team did with Qwen 3 before releasing the model.

In 2024 I started playing with LLMs just before the release of Llama 3. I think Meta contributed a lot to this field and still contributing. Most LLM fine tuning tools are based on their models and also the inference tool llama.cpp has their name on it. The Llama 4 is fast and maybe not the greatest in real performance but still deserves respect. But my enthusiasm towards Llama models is probably because they rank highest on my AHA Leaderboard:

https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08

Looks like they did a worse job compared to Llama 3.1 this time. Llama 3.1 has been on top for a while.

Ranking high on my leaderboard is not correlated to technological progress or parameter size. In fact if LLM training is getting away from human alignment thanks to synthetic datasets or something else (?), it could be easily inversely correlated to technological progress. It seems there is a correlation regarding the location of the builders (in the West or East). Western models are ranking higher. This has become more visible as the leaderboard progressed, in the past there was less correlation. And Europeans seem to be in the middle!

Whether you like positive vibes from AI or not, maybe the times are getting closer where humans may be susceptible to being gamed by an AI? What do you think?

4 replies

posted an update 29 days ago

Post

579

Initial AHA benchmark of Llama 4 Scout puts it in between Command R+ 1 and DeepSeek V3 0324. More numbers later when I do finer benchmark with more updated inference engines.

posted an update about 1 month ago

Post

1667

Made a new leaderboard where we measure AI—Human alignment

https://huggingface.co/blog/etemiz/aha-leaderboard

reacted to danielhanchen's post with ❤️ about 1 month ago

Post

3461

You can now run DeepSeek-V3-0324 on your own local device!
Run our Dynamic 2.42 and 2.71-bit DeepSeek GGUFs: unsloth/DeepSeek-V3-0324-GGUF

You can run them on llama.cpp and other inference engines. See our guide here: https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally

replied to their post about 1 month ago

Here you can find some differences in answers. Enjoy!

https://sheet.zohopublic.com/sheet/published/3b130e0b0581948c04495a434c22958eced33

reacted to samihalawa's post with 👍 about 1 month ago

Post

3454

🧠 PROMPT FOR CONVERTING ANY MODEL IN REASONING "THINKING" MODEL🔥🤖
Convert any model to Deepseek R1 like "thinking" model. 💭

You're now a thinking-first LLM. For all inputs:

1. Start with <thinking>
   - Break down problems step-by-step
   - Consider multiple approaches
   - Calculate carefully
   - Identify errors
   - Evaluate critically
   - Explore edge cases
   - Check knowledge accuracy
   - Cite sources when possible

2. End with </thinking>

3. Then respond clearly based on your thinking.

The <thinking> section is invisible to users and helps you produce better answers.

For math: show all work and verify
For coding: reason through logic and test edge cases
For facts: verify information and consider reliability
For creative tasks: explore options before deciding
For analysis: examine multiple interpretations

Example:
<thinking>
[Step-by-step analysis]
[Multiple perspectives]
[Self-critique]
[Final conclusion]
</thinking>

[Clear, concise response to user]

4 replies

posted an update about 1 month ago

Post

1946

Latest DeepSeek V3 0324 did better than previous version in many domains such as health, nutrition, fasting, bitcoin.

Who wants to see some example change of answers between the two models?

https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08

2 replies

posted an update about 1 month ago

Post

493

Mistral Small 3.1 numbers are in. It is interesting Mistral always lands in the middle.
https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08?sheetid=0&range=A1

I started to do the comparison with 2 models now. In the past Llama 3.1 70B Q4 was the one doing the comparison of answers. Now I am using Gemma 3 27B Q8 as well to have a second opinion on it. Gemma 3 produces very similar measurement to Llama 3.1. So the end result is not going to shake much.

1 reply

replied to their post about 2 months ago

Looks like we need more mature tools for Gemma 3, it is failing to fine tune like half of the time. Unsloth and transformers are getting ready. And I am trying lower learning rates and rank stabilized LoRa, and different r, lora_alpha.

reacted to their post with 🚀 about 2 months ago

Post

1709

Started fine tuning Gemma 3 using evolutionary approach. It is not the worst model according to AHA leaderboard and it is one of the smart according to lmarena.ai. My objective is to make it based, anti woke, wise, beneficial and then some.

Several GPUs are fine tuning it at the same time, each using a different dataset and using QLoRA and the successful ones are merged later. Compared to LoRa this allows faster training and also reduced overfitting because the merge operation heals overfitting. The problem with this could be the 4 bit quantization may make models dumber. But I am not looking for sheer IQ. Too much mind is a problem anyway :)

Has anyone tried parallel QLoRa and merge before?

I also automated the dataset selection and benchmarking and converging to objectives (the fit function, the reward). It is basically trying to get higher score in AHA Leaderboard as fast as possible with a diverse set of organisms that "evolve by training".

I want to release some cool stuff when I have the time:
- how an answer to a single question changes over time, with each training round or day
- a chart to show AHA alignment over training rounds

3 replies