File size: 14,852 Bytes

0942cf9

---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
license: apache-2.0
base_model:
- deepvk/RuModernBERT-base
datasets:
- deepvk/ru-HNP
- deepvk/ru-WANLI
- deepvk/cultura_ru_ed
- Shitao/bge-m3-data
- CarlBrendt/Summ_Dialog_News
- IlyaGusev/gazeta
- its5Q/habr_qna
- wikimedia/wikipedia
- RussianNLP/wikiomnia
language:
- ru
---

# USER2-base

**USER2** is a new generation of the **U**niversal **S**entence **E**ncoder for **R**ussian, designed for sentence representation with long-context support of up to 8,192 tokens.

The models are built on top of the [`RuModernBERT`](https://huggingface.co/collections/deepvk/rumodernbert-67b5e82fbc707d7ed3857743) encoders and are fine-tuned for retrieval and semantic tasks.  
They also support [Matryoshka Representation Learning (MRL)](https://arxiv.org/abs/2205.13147) — a technique that enables reducing embedding size with minimal loss in representation quality.

This is a base model with 149 million parameters.

| Model                                                                  | Size | Context Length | Hidden Dim | MRL Dims |
|-----------------------------------------------------------------------:|:----:|:--------------:|:----------:|:-----------------------:|
| [`deepvk/USER2-small`](https://huggingface.co/deepvk/USER2-small)      | 34M  | 8192           | 384        | [32, 64, 128, 256, 384] |
| `deepvk/USER2-base`                                                    | 149M | 8192           | 768        | [32, 64, 128, 256, 384, 512, 768] |

## Performance

To evaluate the model, we measure quality on the `MTEB-rus` benchmark.
Additionally, to measure long-context retrieval, we run Russian subset of MultiLongDocRetrieval (MLDR) task.

**MTEB-rus**

| Model                                                                                          | Size  | Hidden Dim | Context Length | MRL support | Mean(task) | Mean(taskType) | Classification | Clustering | MultiLabelClassification | PairClassification | Reranking | Retrieval | STS   |
|----------------------------------------------------------------------------------------------:|:-----:|:----------:|:--------------:|:-----------:|:----------:|:--------------:|:-------------:|:----------:|:------------------------:|:-----------------:|:---------:|:---------:|:-----:|
| `USER-base`                      | 124M | 768   | 512  | ❌ | 58.11 | 56.67 | 59.89 | 53.26 | 37.72 | 59.76 | 55.58 | 56.14 | 74.35 |
| `USER-bge-m3`                    | 359M | 1024  | 8192 | ❌ | 62.80 | 62.28 | 61.92 | 53.66 | 36.18 | 65.07 | 68.72 | 73.63 | 76.76 |
| `multilingual-e5-base`           | 278M | 768   | 512  | ❌ | 58.34 | 57.24 | 58.25 | 50.27 | 33.65 | 54.98 | 66.24 | 67.14 | 70.16 |
| `multilingual-e5-large-instruct` | 560M | 1024  | 512  | ❌ | 65.00 | 63.36 | 66.28 | 63.13 | 41.15 | 63.89 | 64.35 | 68.23 | 76.48 | 
| `jina-embeddings-v3`             | 572M | 1024  | 8192 | ✅ | 63.45 | 60.93 | 65.24 | 60.90 | 39.24 | 59.22 | 53.86 | 71.99 | 76.04 |
| `ru-en-RoSBERTa`                 | 404M | 1024  | 512  | ❌ | 61.71 | 60.40 | 62.56 | 56.06 | 38.88 | 60.79 | 63.89 | 66.52 | 74.13 |
| `USER2-small`                    | 34M  | 384   | 8192 | ✅ | 58.32 | 56.68 | 59.76 | 57.06 | 33.56 | 54.02 | 58.26 | 61.87 | 72.25 |
| `USER2-base`                     | 149M | 768   | 8192 | ✅ | 61.12 | 59.59 | 61.67 | 59.22 | 36.61 | 56.39 | 62.06 | 66.90 | 74.28 |

**MLDR-rus**

| Model                | Size      | nDCG@10 ↑ |
|---------------------:|:---------:|:---------:|
| `USER-bge-m3`        |  359M     | 58.53     | 
| `KaLM-v1.5`          |  494M     | 53.75     | 
| `jina-embeddings-v3` |  572M     | 49.67     |
| `E5-mistral-7b`      |  7.11B    | 52.40     |
| `USER2-small`        |  34M      | 51.69     |
| `USER2-base`         |  149M     | 54.17     |

We compare only model with context length of 8192.

## Matryoshka

To evaluate MRL capabilities, we also use `MTEB-rus`, applying dimensionality cropping to the embeddings to match the selected size.

<img src="assets/mrl.png" alt="MRL" width="600"/>

## Usage

### Prefixes

This model is trained similarly to [Nomic Embed](https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes) and expects task-specific prefixes to be added to the input. The choice of prefix depends on the specific task. We follow a few general guidelines when selecting a prefix:
- "classification: " is the default and most universal prefix, often performing well across a variety of tasks.
- "clustering: " is recommended for clustering applications: group texts into clusters, discover shared topics, or remove semantic duplicates.
- "search_query: " and "search_document: " are intended for retrieval and reranking tasks. Also, in some classification tasks, especially with shorter texts, "search_query" shows superior performance to other prefixes. On the other hand, "search_document" can be beneficial for long-context sentence similarity tasks.

However, we encourage users to experiment with different prefixes, as certain domains may benefit from specific ones.

### Sentence Transformers

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("deepvk/USER2-base")

query_embeddings = model.encode(["Когда был спущен на воду первый миноносец «Спокойный»?"], prompt_name="search_query")
document_embeddings = model.encode(["Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."], prompt_name="search_document")

similarities = model.similarity(query_embeddings, document_embeddings)
```

To truncate the embedding dimension, simply pass the new value to the model initialization:
```python
model = SentenceTransformer("deepvk/USER2-base", truncate_dim=128)
```
This model was trained with dimensions `[32, 64, 128, 256, 384, 512, 768]`, so it’s recommended to use one of these for best performance.

### Transformers

```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = (
        attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    )
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
        input_mask_expanded.sum(1), min=1e-9
    )


queries = ["search_query: Когда был спущен на воду первый миноносец «Спокойный»?"]
documents = ["search_document: Спокойный (эсминец)\nЗачислен в списки ВМФ СССР 19 августа 1952 года."]

tokenizer = AutoTokenizer.from_pretrained("deepvk/USER2-base")
model = AutoModel.from_pretrained("deepvk/USER2-base")

encoded_queries = tokenizer(queries, padding=True, truncation=True, return_tensors="pt")
encoded_documents = tokenizer(documents, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    queries_outputs = model(**encoded_queries)
    documents_outputs = model(**encoded_documents)

query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = mean_pooling(documents_outputs, encoded_documents["attention_mask"])
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)

similarities = query_embeddings @ doc_embeddings.T
```

To truncate the embedding dimension, select the first values:
```python
query_embeddings = mean_pooling(queries_outputs, encoded_queries["attention_mask"])
query_embeddings = query_embeddings[:, :truncate_dim]
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
```

## Training details

This is the base version with 149 million parameters, based on [`RuModernBERT-base`](https://huggingface.co/deepvk/RuModernBERT-base).  
It was fine-tuned in three stages: RetroMAE, Weakly Supervised Fine-Tuning, and Supervised Fine-Tuning.

Following the *bge-m3* training strategy, we use RetroMAE as a retrieval-oriented continuous pretraining step.  
Leveraging data from the final stage of RuModernBERT training, RetroMAE enhances retrieval quality—particularly for long-context inputs.

To follow best practices for building a state-of-the-art encoder, we rely on large-scale training with weakly related text pairs.  
However, such datasets are not publicly available for Russian, unlike for English or Chinese.  
To overcome this, we apply two complementary strategies:

- **Cross-lingual transfer**: We train on both English and Russian data, leveraging English resources (`nomic-unsupervised`) alongside our in-house English-Russian parallel corpora.  
- **Unsupervised pair mining**: From the [`deepvk/cultura_ru_edu`](https://huggingface.co/datasets/deepvk/cultura_ru_edu) corpus, we extract 50M pairs using a simple heuristic—selecting non-overlapping text blocks that are not substrings of one another.

This approach has shown promising results, allowing us to train high-performing models with minimal target-language pairs—especially when compared to pipelines used for other languages.

The table below shows the datasets used and the number of times each was upsampled.

| Dataset                     | Size | Upsample |
|----------------------------:|:----:|:-------:|
| [nomic-en](https://github.com/nomic-ai/nomic)                    | 235M |   1      |
| [nomic-ru](https://github.com/nomic-ai/nomic)                    | 39M  |   3      |
| in-house En-Ru parallel              | 250M |   1      |
| [cultura-sampled](https://huggingface.co/datasets/deepvk/cultura_ru_edu)             | 50M  |   1      |
| **Total**                   | 652M |          |

For the third stage, we switch to cleaner, task-specific datasets.  
In some cases, additional filtering was applied using a cross-encoder.  
For all retrieval datasets, we mine hard negatives.

| Dataset                                                                                                                                          | Examples | Notes                                     |
|-------------------------------------------------------------------------------------------------------------------------------------------------:|:--------:|:------------------------------------------|
| [Nomic-en-supervised](https://huggingface.co/datasets/nomic-ai/nomic-embed-supervised-data)                                                     | 1.7 M    | Unmodified                                |
| AllNLI                                                                                                                                            | 200 K    | Translated SNLI/MNLI/ANLI to Russian      |
| [fishkinet-posts](https://huggingface.co/datasets/nyuuzyou/fishkinet-posts)                                                                       | 93 K     | Title–content pairs                       |
| [gazeta](https://huggingface.co/datasets/IlyaGusev/gazeta)                                                                                        | 55 K     | Title–text pairs                          |
| [habr_qna](https://huggingface.co/datasets/its5Q/habr_qna)                                                                                        | 100 K    | Title–description pairs                   |
| [lenta](https://huggingface.co/datasets/zloelias/lenta-ru)                                                                                        | 100 K    | Title–news pairs                          |
| [miracl_ru](https://huggingface.co/datasets/Shitao/bge-m3-data)                                                                                   | 10 K     | One positive per anchor                   |
| [mldr_ru](https://huggingface.co/datasets/Shitao/bge-m3-data)                                                                                     | 1.8 K    | Unmodified                                |
| [mr-tydi_ru](https://huggingface.co/datasets/Shitao/bge-m3-data)                                                                                  | 5.3 K    | Unmodified                                |
| [mmarco_ru](https://huggingface.co/datasets/unicamp-dl/mmarco)                                                                                    | 500 K    | Unmodified                       |
| [ru-HNP](https://huggingface.co/datasets/deepvk/ru-HNP)                                                                                           | 100 K    | One pos + one neg per anchor              |
| ru‑queries                                                                                                                                         | 199 K    | In-house (generated as in [arXiv:2401.00368](https://arxiv.org/abs/2401.00368)) |
| [ru‑WaNLI](https://huggingface.co/datasets/deepvk/ru-WANLI)                                                                                        | 35 K     | Entailment -> pos, contradiction -> neg         |
| [sampled_wiki](https://huggingface.co/datasets/wikimedia/wikipedia)                                                                                | 1 M      | Sampled text blocks from Wikipedia        |
| [summ_dialog_news](https://huggingface.co/datasets/CarlBrendt/Summ_Dialog_News)                                                                    | 37 K     | Summary–info pairs                        |
| [wikiomnia_qna](https://huggingface.co/datasets/RussianNLP/wikiomnia)                                                                              | 100 K    | QA pairs (T5-generated)                  |
| [yandex_q](https://huggingface.co/datasets/its5Q/yandex-q)                                                                                         | 83 K     | Q+desc-answer pairs                     |
| **Total**                                                                                                                                        | 4.3 M    |                                           |


### Ablation

Alongside the final model, we also release all intermediate training steps.  
Both the **retromae** and **weakly_sft** models are available under the specified revisions in this repository.  
We hope these additional models prove useful for your experiments.

Below is a comparison of all training stages on a subset of `MTEB-rus`.

<img src="assets/training_stages.png" alt="training_stages" width="600"/>

## Citations

```
@misc{deepvk2025user,
    title={USER2},
    author={Malashenko, Boris and Spirin, Egor and Sokolov Andrey},
    url={https://huggingface.co/deepvk/USER2-base},
    publisher={Hugging Face}
    year={2025},
}
```