Feature Extraction
Transformers
Safetensors
ModularStarEncoder
custom_code
File size: 5,855 Bytes
111d9ce
 
ae18bf0
 
aa7347f
ae18bf0
0c0d4bb
aa7347f
111d9ce
 
0c0d4bb
111d9ce
 
 
783a0e7
1e39cf5
0c0d4bb
111d9ce
1642303
 
111d9ce
036951c
0c0d4bb
6d776da
ae18bf0
 
 
 
 
 
 
 
 
ab0d30a
ae18bf0
 
 
 
 
 
 
 
 
 
 
 
 
 
2987ef8
ae18bf0
 
a54a692
ae18bf0
 
a54a692
ae18bf0
 
 
 
a63a9cf
ae18bf0
 
 
0c0d4bb
 
111d9ce
 
0c0d4bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0fa8d8d
 
 
1e39cf5
0fa8d8d
 
 
 
 
 
 
 
 
65836a2
0fa8d8d
0c0d4bb
a1dde4a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
---
library_name: transformers
datasets:
- bigcode/the-stack-v2
- modularStarEncoder/SynthCode2Code2NL-neardedup
license: bigcode-openrail-m
base_model:
- modularStarEncoder/ModularStarEncoder
---

# ModularStarEncoder-1B Fine-Tuned model

<!-- Provide a quick summary of what the model is/does. -->

ModularStarEncoder-finetuned is an encoder built on top of [ModularStarEncoder-1B Pre-trained](https://huggingface.co/andreagurioli1995/ModularStarEncoder) on [SynthCode2Code2NL](https://huggingface.co/datasets/andreagurioli1995/SynthCode2Code2NL-neardedup). 
ModularStarEncoder, fine-tuned, is an encoder for code-to-code and text-to-code retrieval tasks, enabling the end user to select the model size that meets their memory and computational constraints.
We built ModularStarEncoder on top of [StarCoder-2](https://huggingface.co/bigcode/starcoder2-15b), reducing its size from 15B to 1B parameters in bfloat16.

The model is finetuned with [CLIP objective](https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/loss.py).
ModularStarEncoder fine-tuned works with instruction prompts; to get the most out of the model, embed the task in the input. The How to Use section below provides more details.

- **Paper:** [One Model to Train them All: Hierarchical Self-Distillation for Enhanced Early Layer Embeddings](https://arxiv.org/abs/2503.03008)
- **Languages:** English, Go, Ruby, Python, Java, C++, PHP, C, JavaScript
- **Different sizes:**  [Layer 4](https://huggingface.co/modularStarEncoder/ModularStarEncoder-finetuned-4), [Layer 9](https://huggingface.co/modularStarEncoder/ModularStarEncoder-finetuned-9), [Layer 18](https://huggingface.co/modularStarEncoder/ModularStarEncoder-finetuned-18), [Layer 27](https://huggingface.co/modularStarEncoder/ModularStarEncoder-finetuned-27), [Layer 36](https://huggingface.co/modularStarEncoder/ModularStarEncoder-finetuned)  

### How to use
```python
from transformers import AutoModel
from transformers import AutoTokenizer

#import the model
model = AutoModel.from_pretrained("andreagurioli1995/ModularStarEncoder-finetuned", trust_remote_code=True)

#import the tokenizer, the tokenizer applies LEFT padding!
tokenizer = AutoTokenizer.from_pretrained("andreagurioli1995/ModularStarEncoder-finetuned")

 
language = "yourlanguagelowercased"

#instruction in case of code embedding in a code language
instruction_code = f"Represent this {language} code snippet for retrieval:"

#instruction in case of code embedding in English
instruction_natural_language = "Represent this code description for retrieving supporting snippets of code:"

code_snippet = "your code to embed here"

#You should follow this pattern to embed a snippet of code or natural language queries 
sentence =  f"{tokenizer.sep_token}{instruction_code}{tokenizer.sep_token}{code_snippet}{tokenizer.cls_token}"

#Tokenizing your sentence
tokenized_sentence = tokenizer(sentence, return_tensors="pt",truncation=True, max_length=2048)

#Embedding the tokenized sentence
embedded_sentence = model(**tokenized_sentence)
```

You will get as an output three elements:

- projected_pooled_normalized: a list of the projected, pooled, and normalized embeddings from the five exit points (respectively from layers [4,9,18,27,36], the last element of the list corresponds to the final layer projected representation);
- raw_hidden_states: raw representation from all the hidden states of the model, without pooling, normalization, and projection
- attentions: attention scores from the encoder
  
  
### Training

<!-- Provide a longer summary of what this model is. -->
We fine-tuned ModularStarEncoder with a batch size of 2048 contrastive samples for 20,000 training steps.
The pre-training and fine-tuning were conducted on 512 NVIDIA Ampere (64GB) GPUs using the [Leonardo](https://arxiv.org/abs/2307.16885) supercomputer, requiring 450,000 GPU working hours.

| Hyperparameter           | Value     |
|--------------------------|-----------|
| Hidden size              | 1024      |
| Max. position embeddings | 2048      |
| Num. of attention heads  | 12        |
| Num. of key values heads | 4         |
| Num. of hidden layers    | 36        |
| Attention                | GQA       |
| Num. of parameters       | ≈1B       |
|Loss function             |CLIP loss  |
|Multi-layer loss          | yes       |


### Evaluation

Here we briefly show our codeSearchNet (codeXGLUE) results between different layers; for full results over text-to-code and code-to-code refer to the article:

| Layer           | Avg. MRR     |
|--------------------------|-----------|
| [Layer 4](https://huggingface.co/modularStarEncoder/ModularStarEncoder-finetuned-4)*              | 73.2     |
| [Layer 9](https://huggingface.co/modularStarEncoder/ModularStarEncoder-finetuned-9)*              |    77.3  |
| [Layer 18](https://huggingface.co/modularStarEncoder/ModularStarEncoder-finetuned-18)*              |  81.0    |
| [Layer 27](https://huggingface.co/modularStarEncoder/ModularStarEncoder-finetuned-27)*            |   80.3   |
| [Layer 36](https://huggingface.co/modularStarEncoder/ModularStarEncoder-finetuned)*              |   79.6   |

- (* size and corresponding projection head present in this model)

## Licence 
The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement).


# Citation
```
@article{gurioli2025modeltrainallhierarchical,
      title={One Model to Train them All: Hierarchical Self-Distillation for Enhanced Early Layer Embeddings}, 
      author={Andrea Gurioli and Federico Pennino and João Monteiro and Maurizio Gabbrielli},
      year={2025},
      eprint={2503.03008},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.03008}, 
}
```