iperbole commited on
Commit
cda188c
·
verified ·
1 Parent(s): 6677b16

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - it
5
+ - en
6
+ ---
7
+
8
+ # Mistral-7B-v0.1-Italian-RANDOM
9
+ <div align="center">
10
+
11
+ <img src="https://github.com/Andrew-Wyn/images/blob/master/sava/italian_adapt-img.jpg?raw=true" width="400" height="400" style="border-radius:10%" />
12
+
13
+ </div>
14
+
15
+ The **Mistral-7B-v0.1-Adapted** collection of large language models (LLMs), is a collection of adapted generative models in 7B (text in/text out), adapted models from **Mistral-7B-Base-v0.1**.
16
+
17
+ *Mistral-v0.1-Italian-RANDOM* is a continual trained mistral model, after tokenizer substitution.
18
+
19
+ The tokenizer of this models after adaptation is the same of [Minverva-3B](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0).
20
+
21
+ **Model developer:** SapienzaNLP, ISTI-CNR, ILC-CNR
22
+
23
+ **Model Architecture:** Mistral-7B-v0.1-Adapted is an auto-regressive language model that uses an optimized transformer architecture.
24
+
25
+ ## Data used for the adaptation
26
+
27
+ The **Mistral-7B-v0.1-Adapted** model are trained on a collection of Italian and English data extracted from [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX).
28
+ The data are extracted to be skewed toward Italian language with a ration of one over four. Extracting the first 9B tokens from Italian part of CulturaX and the first 3B tokens from English part of CulturaX.
29
+
30
+
31
+ ## Use with Transformers
32
+
33
+ You can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.
34
+
35
+ Make sure to update your transformers installation via pip install --upgrade transformers.
36
+
37
+ ```python
38
+ import transformers
39
+ import torch
40
+
41
+ model_id = "SemanticAlignment/Mistral-v0.1-Italian-RANDOM"
42
+
43
+ pipeline = transformers.pipeline(
44
+ "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
45
+ )
46
+
47
+ pipeline("Cosa si può fare in una bella giornata di sole?")
48
+ ```
49
+
50
+ ## Citation
51
+
52
+ If you use any part of this work, please consider citing the paper as follows:
53
+
54
+ ```bibtex
55
+ @misc{moroni2025optimizingllmsitalianreducing,
56
+ title={Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation},
57
+ author={Luca Moroni and Giovanni Puccetti and Pere-Lluis Huguet Cabot and Andrei Stefan Bejgu and Edoardo Barba and Alessio Miaschi and Felice Dell'Orletta and Andrea Esuli and Roberto Navigli},
58
+ year={2025},
59
+ eprint={2504.17025},
60
+ archivePrefix={arXiv},
61
+ primaryClass={cs.CL},
62
+ url={https://arxiv.org/abs/2504.17025},
63
+ }
64
+ ```