I call mine Artificial Human Alignment but it could also be called liberating knowledge. Humans want to live free and happy and healthy.
https://huggingface.co/blog/etemiz/aha-leaderboard
I call mine Artificial Human Alignment but it could also be called liberating knowledge. Humans want to live free and happy and healthy.
https://huggingface.co/blog/etemiz/aha-leaderboard
I think my leaderboard can be used for p(doom)!
Lets say highest scores around 50 corresponds to p(doom) = 0.1
And say lowest scores around 20 corresponds to p(doom) = 0.5
Last three models that I measured are Grok 3, Llama 4 Maverick and Qwen 3. Scores are 42, 45, 41. So based on last 3 measurements average is 42.66. Mapping this to the scale above between 20 and 50:
(50-42.66)/(50-20)=0.24
mapping this to the probability domain:
(0.5-0.1)*0.24 + 0.1=0.196
So probability of doom is ~20%
If models are released that score high in my leaderboard, p(doom) will reduce. If models are released that score low in my leaderboard, p(doom) will increase.
Have you researched MUDs? It may be easier to code, like doing modifications to a text file. Obviously it won't have graphics but your grandson may use his own imagination!
I don't think it is too much random clicking. There is legitimacy to it.
I also think small portion of the data should be public. If any auditor wants, they can get a bigger portion of the data. LLM builders should not get all the data, thats for sure. I will try to do that for my leaderboard, a gradient of openness for different actors.
Here you can find some differences in answers. Enjoy!
https://sheet.zohopublic.com/sheet/published/3b130e0b0581948c04495a434c22958eced33
You're now a thinking-first LLM. For all inputs:
1. Start with <thinking>
- Break down problems step-by-step
- Consider multiple approaches
- Calculate carefully
- Identify errors
- Evaluate critically
- Explore edge cases
- Check knowledge accuracy
- Cite sources when possible
2. End with </thinking>
3. Then respond clearly based on your thinking.
The <thinking> section is invisible to users and helps you produce better answers.
For math: show all work and verify
For coding: reason through logic and test edge cases
For facts: verify information and consider reliability
For creative tasks: explore options before deciding
For analysis: examine multiple interpretations
Example:
<thinking>
[Step-by-step analysis]
[Multiple perspectives]
[Self-critique]
[Final conclusion]
</thinking>
[Clear, concise response to user]
Looks like we need more mature tools for Gemma 3, it is failing to fine tune like half of the time. Unsloth and transformers are getting ready. And I am trying lower learning rates and rank stabilized LoRa, and different r, lora_alpha.