File size: 14,852 Bytes
0942cf9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 |
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
license: apache-2.0
base_model:
- deepvk/RuModernBERT-base
datasets:
- deepvk/ru-HNP
- deepvk/ru-WANLI
- deepvk/cultura_ru_ed
- Shitao/bge-m3-data
- CarlBrendt/Summ_Dialog_News
- IlyaGusev/gazeta
- its5Q/habr_qna
- wikimedia/wikipedia
- RussianNLP/wikiomnia
language:
- ru
---
# USER2-base
**USER2** is a new generation of the **U**niversal **S**entence **E**ncoder for **R**ussian, designed for sentence representation with long-context support of up to 8,192 tokens.
The models are built on top of the [`RuModernBERT`](https://huggingface.co/collections/deepvk/rumodernbert-67b5e82fbc707d7ed3857743) encoders and are fine-tuned for retrieval and semantic tasks.
They also support [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) — a technique that enables reducing embedding size with minimal loss in representation quality.
This is a base model with 149 million parameters.
| Model | Size | Context Length | Hidden Dim | MRL Dims |
|-----------------------------------------------------------------------:|:----:|:--------------:|:----------:|:-----------------------:|
| [`deepvk/USER2-small`](https://huggingface.co/deepvk/USER2-small) | 34M | 8192 | 384 | [32, 64, 128, 256, 384] |
| `deepvk/USER2-base` | 149M | 8192 | 768 | [32, 64, 128, 256, 384, 512, 768] |
## Performance
To evaluate the model, we measure quality on the `MTEB-rus` benchmark.
Additionally, to measure long-context retrieval, we run Russian subset of MultiLongDocRetrieval (MLDR) task.
**MTEB-rus**
| Model | Size | Hidden Dim | Context Length | MRL support | Mean(task) | Mean(taskType) | Classification | Clustering | MultiLabelClassification | PairClassification | Reranking | Retrieval | STS |
|----------------------------------------------------------------------------------------------:|:-----:|:----------:|:--------------:|:-----------:|:----------:|:--------------:|:-------------:|:----------:|:------------------------:|:-----------------:|:---------:|:---------:|:-----:|
| `USER-base` | 124M | 768 | 512 | ❌ | 58.11 | 56.67 | 59.89 | 53.26 | 37.72 | 59.76 | 55.58 | 56.14 | 74.35 |
| `USER-bge-m3` | 359M | 1024 | 8192 | ❌ | 62.80 | 62.28 | 61.92 | 53.66 | 36.18 | 65.07 | 68.72 | 73.63 | 76.76 |
| `multilingual-e5-base` | 278M | 768 | 512 | ❌ | 58.34 | 57.24 | 58.25 | 50.27 | 33.65 | 54.98 | 66.24 | 67.14 | 70.16 |
| `multilingual-e5-large-instruct` | 560M | 1024 | 512 | ❌ | 65.00 | 63.36 | 66.28 | 63.13 | 41.15 | 63.89 | 64.35 | 68.23 | 76.48 |
| `jina-embeddings-v3` | 572M | 1024 | 8192 | ✅ | 63.45 | 60.93 | 65.24 | 60.90 | 39.24 | 59.22 | 53.86 | 71.99 | 76.04 |
| `ru-en-RoSBERTa` | 404M | 1024 | 512 | ❌ | 61.71 | 60.40 | 62.56 | 56.06 | 38.88 | 60.79 | 63.89 | 66.52 | 74.13 |
| `USER2-small` | 34M | 384 | 8192 | ✅ | 58.32 | 56.68 | 59.76 | 57.06 | 33.56 | 54.02 | 58.26 | 61.87 | 72.25 |
| `USER2-base` | 149M | 768 | 8192 | ✅ | 61.12 | 59.59 | 61.67 | 59.22 | 36.61 | 56.39 | 62.06 | 66.90 | 74.28 |
**MLDR-rus**
| Model | Size | nDCG@10 ↑ |
|---------------------:|:---------:|:---------:|
| `USER-bge-m3` | 359M | 58.53 |
| `KaLM-v1.5` | 494M | 53.75 |
| `jina-embeddings-v3` | 572M | 49.67 |
| `E5-mistral-7b` | 7.11B | 52.40 |
| `USER2-small` | 34M | 51.69 |
| `USER2-base` | 149M | 54.17 |
We compare only model with context length of 8192.
## Matryoshka
To evaluate MRL capabilities, we also use `MTEB-rus`, applying dimensionality cropping to the embeddings to match the selected size.
<img src="assets/mrl.png" alt="MRL" width="600"/>
## Usage
### Prefixes
This model is trained similarly to [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes) and expects task-specific prefixes to be added to the input. The choice of prefix depends on the specific task. We follow a few general guidelines when selecting a prefix:
- "classification: " is the default and most universal prefix, often performing well across a variety of tasks.
- "clustering: " is recommended for clustering applications: group texts into clusters, discover shared topics, or remove semantic duplicates.
- "search_query: " and "search_document: " are intended for retrieval and reranking tasks. Also, in some classification tasks, especially with shorter texts, "search_query" shows superior performance to other prefixes. On the other hand, "search_document" can be beneficial for long-context sentence similarity tasks.
However, we encourage users to experiment with different prefixes, as certain domains may benefit from specific ones.
### Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("deepvk/USER2-base")
query_embeddings = model.encode(["Когда был спущен на воду первый миноносец «Спокойный»?"], prompt_name="search_query")
document_embeddings = model.encode(["Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."], prompt_name="search_document")
similarities = model.similarity(query_embeddings, document_embeddings)
```
To truncate the embedding dimension, simply pass the new value to the model initialization:
```python
model = SentenceTransformer("deepvk/USER2-base", truncate_dim=128)
```
This model was trained with dimensions `[32, 64, 128, 256, 384, 512, 768]`, so it’s recommended to use one of these for best performance.
### Transformers
```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
queries = ["search_query: Когда был спущен на воду первый миноносец «Спокойный»?"]
documents = ["search_document: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."]
tokenizer = AutoTokenizer.from_pretrained("deepvk/USER2-base")
model = AutoModel.from_pretrained("deepvk/USER2-base")
encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
queries_outputs = model(**encoded_queries)
documents_outputs = model(**encoded_documents)
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
similarities = query_embeddings @ doc_embeddings.T
```
To truncate the embedding dimension, select the first values:
```python
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = query_embeddings[:, :truncate_dim]
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
```
## Training details
This is the base version with 149 million parameters, based on [`RuModernBERT-base`](https://huggingface.co/deepvk/RuModernBERT-base).
It was fine-tuned in three stages: RetroMAE, Weakly Supervised Fine-Tuning, and Supervised Fine-Tuning.
Following the *bge-m3* training strategy, we use RetroMAE as a retrieval-oriented continuous pretraining step.
Leveraging data from the final stage of RuModernBERT training, RetroMAE enhances retrieval quality—particularly for long-context inputs.
To follow best practices for building a state-of-the-art encoder, we rely on large-scale training with weakly related text pairs.
However, such datasets are not publicly available for Russian, unlike for English or Chinese.
To overcome this, we apply two complementary strategies:
- **Cross-lingual transfer**: We train on both English and Russian data, leveraging English resources (`nomic-unsupervised`) alongside our in-house English-Russian parallel corpora.
- **Unsupervised pair mining**: From the [`deepvk/cultura_ru_edu`](https://huggingface.co/datasets/deepvk/cultura_ru_edu) corpus, we extract 50M pairs using a simple heuristic—selecting non-overlapping text blocks that are not substrings of one another.
This approach has shown promising results, allowing us to train high-performing models with minimal target-language pairs—especially when compared to pipelines used for other languages.
The table below shows the datasets used and the number of times each was upsampled.
| Dataset | Size | Upsample |
|----------------------------:|:----:|:-------:|
| [nomic-en](https://github.com/nomic-ai/nomic) | 235M | 1 |
| [nomic-ru](https://github.com/nomic-ai/nomic) | 39M | 3 |
| in-house En-Ru parallel | 250M | 1 |
| [cultura-sampled](https://huggingface.co/datasets/deepvk/cultura_ru_edu) | 50M | 1 |
| **Total** | 652M | |
For the third stage, we switch to cleaner, task-specific datasets.
In some cases, additional filtering was applied using a cross-encoder.
For all retrieval datasets, we mine hard negatives.
| Dataset | Examples | Notes |
|-------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:------------------------------------------|
| [Nomic-en-supervised](https://huggingface.co/datasets/nomic-ai/nomic-embed-supervised-data) | 1.7 M | Unmodified |
| AllNLI | 200 K | Translated SNLI/MNLI/ANLI to Russian |
| [fishkinet-posts](https://huggingface.co/datasets/nyuuzyou/fishkinet-posts) | 93 K | Title–content pairs |
| [gazeta](https://huggingface.co/datasets/IlyaGusev/gazeta) | 55 K | Title–text pairs |
| [habr_qna](https://huggingface.co/datasets/its5Q/habr_qna) | 100 K | Title–description pairs |
| [lenta](https://huggingface.co/datasets/zloelias/lenta-ru) | 100 K | Title–news pairs |
| [miracl_ru](https://huggingface.co/datasets/Shitao/bge-m3-data) | 10 K | One positive per anchor |
| [mldr_ru](https://huggingface.co/datasets/Shitao/bge-m3-data) | 1.8 K | Unmodified |
| [mr-tydi_ru](https://huggingface.co/datasets/Shitao/bge-m3-data) | 5.3 K | Unmodified |
| [mmarco_ru](https://huggingface.co/datasets/unicamp-dl/mmarco) | 500 K | Unmodified |
| [ru-HNP](https://huggingface.co/datasets/deepvk/ru-HNP) | 100 K | One pos + one neg per anchor |
| ru‑queries | 199 K | In-house (generated as in [arXiv:2401.00368](https://arxiv.org/abs/2401.00368)) |
| [ru‑WaNLI](https://huggingface.co/datasets/deepvk/ru-WANLI) | 35 K | Entailment -> pos, contradiction -> neg |
| [sampled_wiki](https://huggingface.co/datasets/wikimedia/wikipedia) | 1 M | Sampled text blocks from Wikipedia |
| [summ_dialog_news](https://huggingface.co/datasets/CarlBrendt/Summ_Dialog_News) | 37 K | Summary–info pairs |
| [wikiomnia_qna](https://huggingface.co/datasets/RussianNLP/wikiomnia) | 100 K | QA pairs (T5-generated) |
| [yandex_q](https://huggingface.co/datasets/its5Q/yandex-q) | 83 K | Q+desc-answer pairs |
| **Total** | 4.3 M | |
### Ablation
Alongside the final model, we also release all intermediate training steps.
Both the **retromae** and **weakly_sft** models are available under the specified revisions in this repository.
We hope these additional models prove useful for your experiments.
Below is a comparison of all training stages on a subset of `MTEB-rus`.
<img src="assets/training_stages.png" alt="training_stages" width="600"/>
## Citations
```
@misc{deepvk2025user,
title={USER2},
author={Malashenko, Boris and Spirin, Egor and Sokolov Andrey},
url={https://huggingface.co/deepvk/USER2-base},
publisher={Hugging Face}
year={2025},
}
``` |