lol thanks! Iโve always wondered why HF posts donโt support markdown.
Huang Liang Hsun PRO
lianghsun
AI & ML interests
Founder of ๐ง๐๐ถ๐ป๐ธ๐น๐ฒ ๐๐. Focused on applying deep learning in legal and scientific domains, with expertise in NLP and model fine-tuning.
Recent Activity
liked
a dataset
about 9 hours ago
trendmicro-ailab/Primus-FineWeb
updated
a model
about 10 hours ago
lianghsun/Llama-3.3-70B-Taiwan-Cyber-Instruct
published
a model
about 10 hours ago
lianghsun/Llama-3.3-70B-Taiwan-Cyber-Instruct
Organizations
lianghsun's activity

replied to
their
post
24 days ago

posted
an
update
24 days ago
Post
2281
With the arrival of Twinkle April โ Twinkle AIโs annual open-source celebration held every April โ our community is excited to unveil its very first project:
๐ Twinkle Eval (https://github.com/ai-twinkle/Eval), a next-generation evaluation tool led by our contributor @tedslin .
Unlike traditional evaluation tools like iKalaโs ievals (https://github.com/ikala-ai/ievals), which can only evaluate language models (LMs) one sample at a time, Twinkle Eval is designed with Large Reasoning Models (LRMs) in mind. As reasoning time increases with more complex models, traditional tools become increasingly inefficient ๐ฒ โ for example, evaluating LRMs on the ikala/tmmluplus benchmark could take *
half a day without finishing.
One question we were especially curious about:
Does shuffling multiple-choice answer order impact model accuracy? ๐ค
โ See: "Change Answer Order Can Decrease MMLU Accuracy" โ arXiv:2406.19470v1
To address these challenges, Twinkle Eval brings three key innovations to the table:
1๏ธโฃ Parallelized evaluation of samples
2๏ธโฃ Multi-round testing for stability
3๏ธโฃ Randomized answer order to test robustness
After running experiments, we observed that Twinkle Eval can speed up evaluation by up to 15ร ๐๐. Interestingly, most models scored slightly lower under the 2๏ธโฃ3๏ธโฃ test settings compared to their claimed performance โ suggesting further benchmarking is needed.
This framework also comes with additional tunable parameters and detailed logging of LM behavior per question โ perfect for those who want to dive deeper. ๐
If you find Twinkle Eval useful, please โญ the project and help spread the word ๐ค

posted
an
update
3 months ago
Post
2592
๐ Let me introduce the work I've done over the past three months: ๐๐น๐ฎ๐บ๐ฎ-๐ฏ.๐ฎ-๐ง๐ฎ๐ถ๐๐ฎ๐ป-๐ฏ๐ and ๐๐น๐ฎ๐บ๐ฎ-๐ฏ.๐ฎ-๐ง๐ฎ๐ถ๐๐ฎ๐ป-๐ฏ๐-๐๐ป๐๐๐ฟ๐๐ฐ๐, now open-sourced on ๐ค Hugging Face.
๐น๐ถ๐ฎ๐ป๐ด๐ต๐๐๐ป/๐๐น๐ฎ๐บ๐ฎ-๐ฏ.๐ฎ-๐ง๐ฎ๐ถ๐๐ฎ๐ป-๐ฏ๐: This model is built on top of ๐บ๐ฒ๐๐ฎ-๐น๐น๐ฎ๐บ๐ฎ/๐๐น๐ฎ๐บ๐ฎ-๐ฏ.๐ฎ-๐ฏ๐ with continual pretraining. The training dataset consists of a mixture of Traditional Chinese and multilingual texts in specific proportions, including 20B tokens of Traditional Chinese text.
๐น๐ถ๐ฎ๐ป๐ด๐ต๐๐๐ป/๐๐น๐ฎ๐บ๐ฎ-๐ฏ.๐ฎ-๐ง๐ฎ๐ถ๐๐ฎ๐ป-๐ฏ๐-๐๐ป๐๐๐ฟ๐๐ฐ๐: This is a fine-tuned conversational model based on the foundation model.
This Llama-3.2-Taiwan open-source project is currently a one-person effort (yes, I did everything from text preparation โ so exhausting!). If you're interested, feel free to join the Discord server for discussions.
๐ ฑ๐ ด๐ ฝ๐ ฒ๐ ท๐ ผ๐ ฐ๐๐ บ๐ ธ๐ ฝ๐ ถ
The evaluation was conducted using ikala/tmmluplus, though the README page does not yet reflect the latest results. The performance is close to the previous versions, indicating that further improvements might require adding more specialized knowledge in the datasets.
๐ ฐ ๐ ฒ๐ ฐ๐ ป๐ ป ๐ ต๐ พ๐ ๐๐๐ ฟ๐ ฟ๐ พ๐๐
If anyone is willing to provide compute resources, it would be greatly appreciated to help this project continue and grow. ๐ช
---
๐๏ธ Foundation model: lianghsun/Llama-3.2-Taiwan-3B
๐ค Instruction model: lianghsun/Llama-3.2-Taiwan-3B-Instruct
โก GGUF: lianghsun/Llama-3.2-Taiwan-3B-Instruct-GGUF
๐น๐ถ๐ฎ๐ป๐ด๐ต๐๐๐ป/๐๐น๐ฎ๐บ๐ฎ-๐ฏ.๐ฎ-๐ง๐ฎ๐ถ๐๐ฎ๐ป-๐ฏ๐: This model is built on top of ๐บ๐ฒ๐๐ฎ-๐น๐น๐ฎ๐บ๐ฎ/๐๐น๐ฎ๐บ๐ฎ-๐ฏ.๐ฎ-๐ฏ๐ with continual pretraining. The training dataset consists of a mixture of Traditional Chinese and multilingual texts in specific proportions, including 20B tokens of Traditional Chinese text.
๐น๐ถ๐ฎ๐ป๐ด๐ต๐๐๐ป/๐๐น๐ฎ๐บ๐ฎ-๐ฏ.๐ฎ-๐ง๐ฎ๐ถ๐๐ฎ๐ป-๐ฏ๐-๐๐ป๐๐๐ฟ๐๐ฐ๐: This is a fine-tuned conversational model based on the foundation model.
This Llama-3.2-Taiwan open-source project is currently a one-person effort (yes, I did everything from text preparation โ so exhausting!). If you're interested, feel free to join the Discord server for discussions.
๐ ฑ๐ ด๐ ฝ๐ ฒ๐ ท๐ ผ๐ ฐ๐๐ บ๐ ธ๐ ฝ๐ ถ
The evaluation was conducted using ikala/tmmluplus, though the README page does not yet reflect the latest results. The performance is close to the previous versions, indicating that further improvements might require adding more specialized knowledge in the datasets.
๐ ฐ ๐ ฒ๐ ฐ๐ ป๐ ป ๐ ต๐ พ๐ ๐๐๐ ฟ๐ ฟ๐ พ๐๐
If anyone is willing to provide compute resources, it would be greatly appreciated to help this project continue and grow. ๐ช
---
๐๏ธ Foundation model: lianghsun/Llama-3.2-Taiwan-3B
๐ค Instruction model: lianghsun/Llama-3.2-Taiwan-3B-Instruct
โก GGUF: lianghsun/Llama-3.2-Taiwan-3B-Instruct-GGUF