RefalMachine commited on
Commit
11e2fd3
Β·
verified Β·
1 Parent(s): 89ef999

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -8,7 +8,7 @@ pinned: false
8
  ---
9
 
10
  ## Description
11
- **Ruadapt** is a project focused on developing the methodology for adapting large language models (LLMs) to the Russian language, with a change in tokenization to enhance model efficiency.
12
 
13
  In addition to developing the methodology itself, we also employ it to adapt existing SOTA open-source models and make them publicly available. For example, our series of RaadaptQwen2.5 models generate Russian-language text 30-60% faster (in terms of characters) due to more suitable tokenization, while minimizing quality loss on both English and Russian languages.
14
 
@@ -16,7 +16,7 @@ One of the unique features of our approach to adaptation lies in the fact that,
16
 
17
  An intriguing aspect of adapting T-pro-it-1.0 is that this model was obtained through continuous pretraining on over 100 billion tokens of Russian-language data using full fine-tuning. Despite this extensive prior training, our methodology still worked effectively (note: the original base model Qwen2.5-32B was adapted!), and the resulting adapted version either outperformed or matched T-pro-it-1.0 on several benchmarks. Moreover, it demonstrated higher efficiency in Russian-language tokenization.
18
 
19
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/652cedbdf120598322ae358a/L-jQw1MjhdAUbkqVfcrt-.png)
20
 
21
  ## Papers
22
  Tikhomirov M., Chernyshov D. Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation //Journal of Language and Education. – 2024. – Π’. 10. – β„–. 4. – Π‘. 130-145. (Preprint: https://arxiv.org/abs/2412.21140)
 
8
  ---
9
 
10
  ## Description
11
+ **Ruadapt** is a project focused on developing the methodology for adapting large language models (LLMs) to the Russian language, with a change in tokenization to enhance model efficiency. It is important to note that the methodology is **applicable to practically any language**, as it does not employ any language-dependent methods.
12
 
13
  In addition to developing the methodology itself, we also employ it to adapt existing SOTA open-source models and make them publicly available. For example, our series of RaadaptQwen2.5 models generate Russian-language text 30-60% faster (in terms of characters) due to more suitable tokenization, while minimizing quality loss on both English and Russian languages.
14
 
 
16
 
17
  An intriguing aspect of adapting T-pro-it-1.0 is that this model was obtained through continuous pretraining on over 100 billion tokens of Russian-language data using full fine-tuning. Despite this extensive prior training, our methodology still worked effectively (note: the original base model Qwen2.5-32B was adapted!), and the resulting adapted version either outperformed or matched T-pro-it-1.0 on several benchmarks. Moreover, it demonstrated higher efficiency in Russian-language tokenization.
18
 
19
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/652cedbdf120598322ae358a/L-jQw1MjhdAUbkqVfcrt-.png){ width=50% }
20
 
21
  ## Papers
22
  Tikhomirov M., Chernyshov D. Facilitating Large Language Model Russian Adaptation with Learned Embedding Propagation //Journal of Language and Education. – 2024. – Π’. 10. – β„–. 4. – Π‘. 130-145. (Preprint: https://arxiv.org/abs/2412.21140)