File size: 14,852 Bytes
0942cf9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
license: apache-2.0
base_model:
- deepvk/RuModernBERT-base
datasets:
- deepvk/ru-HNP
- deepvk/ru-WANLI
- deepvk/cultura_ru_ed
- Shitao/bge-m3-data
- CarlBrendt/Summ_Dialog_News
- IlyaGusev/gazeta
- its5Q/habr_qna
- wikimedia/wikipedia
- RussianNLP/wikiomnia
language:
- ru
---

# USER2-base

**USER2** is a new generation of the **U**niversal **S**entence **E**ncoder for **R**ussian, designed for sentence representation with long-context support of up to 8,192 tokens.

The models are built on top of the [`RuModernBERT`](https://huggingface.co/collections/deepvk/rumodernbert-67b5e82fbc707d7ed3857743) encoders and are fine-tuned for retrieval and semantic tasks.  
They also support [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) — a technique that enables reducing embedding size with minimal loss in representation quality.

This is a base model with 149 million parameters.

| Model                                                                  | Size | Context Length | Hidden Dim | MRL Dims |
|-----------------------------------------------------------------------:|:----:|:--------------:|:----------:|:-----------------------:|
| [`deepvk/USER2-small`](https://huggingface.co/deepvk/USER2-small)      | 34M  | 8192           | 384        | [32, 64, 128, 256, 384] |
| `deepvk/USER2-base`                                                    | 149M | 8192           | 768        | [32, 64, 128, 256, 384, 512, 768] |

## Performance

To evaluate the model, we measure quality on the `MTEB-rus` benchmark.
Additionally, to measure long-context retrieval, we run Russian subset of MultiLongDocRetrieval (MLDR) task.

**MTEB-rus**

| Model                                                                                          | Size  | Hidden Dim | Context Length | MRL support | Mean(task) | Mean(taskType) | Classification | Clustering | MultiLabelClassification | PairClassification | Reranking | Retrieval | STS   |
|----------------------------------------------------------------------------------------------:|:-----:|:----------:|:--------------:|:-----------:|:----------:|:--------------:|:-------------:|:----------:|:------------------------:|:-----------------:|:---------:|:---------:|:-----:|
| `USER-base`                      | 124M | 768   | 512  | ❌ | 58.11 | 56.67 | 59.89 | 53.26 | 37.72 | 59.76 | 55.58 | 56.14 | 74.35 |
| `USER-bge-m3`                    | 359M | 1024  | 8192 | ❌ | 62.80 | 62.28 | 61.92 | 53.66 | 36.18 | 65.07 | 68.72 | 73.63 | 76.76 |
| `multilingual-e5-base`           | 278M | 768   | 512  | ❌ | 58.34 | 57.24 | 58.25 | 50.27 | 33.65 | 54.98 | 66.24 | 67.14 | 70.16 |
| `multilingual-e5-large-instruct` | 560M | 1024  | 512  | ❌ | 65.00 | 63.36 | 66.28 | 63.13 | 41.15 | 63.89 | 64.35 | 68.23 | 76.48 | 
| `jina-embeddings-v3`             | 572M | 1024  | 8192 | ✅ | 63.45 | 60.93 | 65.24 | 60.90 | 39.24 | 59.22 | 53.86 | 71.99 | 76.04 |
| `ru-en-RoSBERTa`                 | 404M | 1024  | 512  | ❌ | 61.71 | 60.40 | 62.56 | 56.06 | 38.88 | 60.79 | 63.89 | 66.52 | 74.13 |
| `USER2-small`                    | 34M  | 384   | 8192 | ✅ | 58.32 | 56.68 | 59.76 | 57.06 | 33.56 | 54.02 | 58.26 | 61.87 | 72.25 |
| `USER2-base`                     | 149M | 768   | 8192 | ✅ | 61.12 | 59.59 | 61.67 | 59.22 | 36.61 | 56.39 | 62.06 | 66.90 | 74.28 |

**MLDR-rus**

| Model                | Size      | nDCG@10 ↑ |
|---------------------:|:---------:|:---------:|
| `USER-bge-m3`        |  359M     | 58.53     | 
| `KaLM-v1.5`          |  494M     | 53.75     | 
| `jina-embeddings-v3` |  572M     | 49.67     |
| `E5-mistral-7b`      |  7.11B    | 52.40     |
| `USER2-small`        |  34M      | 51.69     |
| `USER2-base`         |  149M     | 54.17     |

We compare only model with context length of 8192.

## Matryoshka

To evaluate MRL capabilities, we also use `MTEB-rus`, applying dimensionality cropping to the embeddings to match the selected size.

<img src="assets/mrl.png" alt="MRL" width="600"/>

## Usage

### Prefixes

This model is trained similarly to [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes) and expects task-specific prefixes to be added to the input. The choice of prefix depends on the specific task. We follow a few general guidelines when selecting a prefix:
- "classification: " is the default and most universal prefix, often performing well across a variety of tasks.
- "clustering: " is recommended for clustering applications: group texts into clusters, discover shared topics, or remove semantic duplicates.
- "search_query: " and "search_document: " are intended for retrieval and reranking tasks. Also, in some classification tasks, especially with shorter texts, "search_query" shows superior performance to other prefixes. On the other hand, "search_document" can be beneficial for long-context sentence similarity tasks.

However, we encourage users to experiment with different prefixes, as certain domains may benefit from specific ones.

### Sentence Transformers

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("deepvk/USER2-base")

query_embeddings = model.encode(["Когда был спущен на воду первый миноносец «Спокойный»?"], prompt_name="search_query")
document_embeddings = model.encode(["Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."], prompt_name="search_document")

similarities = model.similarity(query_embeddings, document_embeddings)
```

To truncate the embedding dimension, simply pass the new value to the model initialization:
```python
model = SentenceTransformer("deepvk/USER2-base", truncate_dim=128)
```
This model was trained with dimensions `[32, 64, 128, 256, 384, 512, 768]`, so it’s recommended to use one of these for best performance.

### Transformers

```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )


queries = ["search_query: Когда был спущен на воду первый миноносец «Спокойный»?"]
documents = ["search_document: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."]

tokenizer = AutoTokenizer.from_pretrained("deepvk/USER2-base")
model = AutoModel.from_pretrained("deepvk/USER2-base")

encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    queries_outputs = model(**encoded_queries)
    documents_outputs = model(**encoded_documents)

query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)

similarities = query_embeddings @ doc_embeddings.T
```

To truncate the embedding dimension, select the first values:
```python
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = query_embeddings[:, :truncate_dim]
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
```

## Training details

This is the base version with 149 million parameters, based on [`RuModernBERT-base`](https://huggingface.co/deepvk/RuModernBERT-base).  
It was fine-tuned in three stages: RetroMAE, Weakly Supervised Fine-Tuning, and Supervised Fine-Tuning.

Following the *bge-m3* training strategy, we use RetroMAE as a retrieval-oriented continuous pretraining step.  
Leveraging data from the final stage of RuModernBERT training, RetroMAE enhances retrieval quality—particularly for long-context inputs.

To follow best practices for building a state-of-the-art encoder, we rely on large-scale training with weakly related text pairs.  
However, such datasets are not publicly available for Russian, unlike for English or Chinese.  
To overcome this, we apply two complementary strategies:

- **Cross-lingual transfer**: We train on both English and Russian data, leveraging English resources (`nomic-unsupervised`) alongside our in-house English-Russian parallel corpora.  
- **Unsupervised pair mining**: From the [`deepvk/cultura_ru_edu`](https://huggingface.co/datasets/deepvk/cultura_ru_edu) corpus, we extract 50M pairs using a simple heuristic—selecting non-overlapping text blocks that are not substrings of one another.

This approach has shown promising results, allowing us to train high-performing models with minimal target-language pairs—especially when compared to pipelines used for other languages.

The table below shows the datasets used and the number of times each was upsampled.

| Dataset                     | Size | Upsample |
|----------------------------:|:----:|:-------:|
| [nomic-en](https://github.com/nomic-ai/nomic)                    | 235M |   1      |
| [nomic-ru](https://github.com/nomic-ai/nomic)                    | 39M  |   3      |
| in-house En-Ru parallel              | 250M |   1      |
| [cultura-sampled](https://huggingface.co/datasets/deepvk/cultura_ru_edu)             | 50M  |   1      |
| **Total**                   | 652M |          |

For the third stage, we switch to cleaner, task-specific datasets.  
In some cases, additional filtering was applied using a cross-encoder.  
For all retrieval datasets, we mine hard negatives.

| Dataset                                                                                                                                          | Examples | Notes                                     |
|-------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:------------------------------------------|
| [Nomic-en-supervised](https://huggingface.co/datasets/nomic-ai/nomic-embed-supervised-data)                                                     | 1.7 M    | Unmodified                                |
| AllNLI                                                                                                                                            | 200 K    | Translated SNLI/MNLI/ANLI to Russian      |
| [fishkinet-posts](https://huggingface.co/datasets/nyuuzyou/fishkinet-posts)                                                                       | 93 K     | Title–content pairs                       |
| [gazeta](https://huggingface.co/datasets/IlyaGusev/gazeta)                                                                                        | 55 K     | Title–text pairs                          |
| [habr_qna](https://huggingface.co/datasets/its5Q/habr_qna)                                                                                        | 100 K    | Title–description pairs                   |
| [lenta](https://huggingface.co/datasets/zloelias/lenta-ru)                                                                                        | 100 K    | Title–news pairs                          |
| [miracl_ru](https://huggingface.co/datasets/Shitao/bge-m3-data)                                                                                   | 10 K     | One positive per anchor                   |
| [mldr_ru](https://huggingface.co/datasets/Shitao/bge-m3-data)                                                                                     | 1.8 K    | Unmodified                                |
| [mr-tydi_ru](https://huggingface.co/datasets/Shitao/bge-m3-data)                                                                                  | 5.3 K    | Unmodified                                |
| [mmarco_ru](https://huggingface.co/datasets/unicamp-dl/mmarco)                                                                                    | 500 K    | Unmodified                       |
| [ru-HNP](https://huggingface.co/datasets/deepvk/ru-HNP)                                                                                           | 100 K    | One pos + one neg per anchor              |
| ru‑queries                                                                                                                                         | 199 K    | In-house (generated as in [arXiv:2401.00368](https://arxiv.org/abs/2401.00368)) |
| [ru‑WaNLI](https://huggingface.co/datasets/deepvk/ru-WANLI)                                                                                        | 35 K     | Entailment -> pos, contradiction -> neg         |
| [sampled_wiki](https://huggingface.co/datasets/wikimedia/wikipedia)                                                                                | 1 M      | Sampled text blocks from Wikipedia        |
| [summ_dialog_news](https://huggingface.co/datasets/CarlBrendt/Summ_Dialog_News)                                                                    | 37 K     | Summary–info pairs                        |
| [wikiomnia_qna](https://huggingface.co/datasets/RussianNLP/wikiomnia)                                                                              | 100 K    | QA pairs (T5-generated)                  |
| [yandex_q](https://huggingface.co/datasets/its5Q/yandex-q)                                                                                         | 83 K     | Q+desc-answer pairs                     |
| **Total**                                                                                                                                        | 4.3 M    |                                           |


### Ablation

Alongside the final model, we also release all intermediate training steps.  
Both the **retromae** and **weakly_sft** models are available under the specified revisions in this repository.  
We hope these additional models prove useful for your experiments.

Below is a comparison of all training stages on a subset of `MTEB-rus`.

<img src="assets/training_stages.png" alt="training_stages" width="600"/>

## Citations

```
@misc{deepvk2025user,
    title={USER2},
    author={Malashenko, Boris and Spirin, Egor and Sokolov Andrey},
    url={https://huggingface.co/deepvk/USER2-base},
    publisher={Hugging Face}
    year={2025},
}
```