File size: 6,292 Bytes

---

tags:
- mteb
- sentence-transformers
- transformers
- sentence-similarity
language:
- en
- zh
license: apache-2.0
---


# Conan-Embedding-v2

## What's New?

- **Performance**
  
  Conan-Embedding-v2 has now achieved SOTA performance on the MTEB leaderboard for both Chinese and English.
  
- **Cross-lingual Retrieval between Chinese and English**
  
  Conan-Embedding-v2 supports cross-lingual retrieval between Chinese and English samples.
  
- **Longer Context Support**
  
  Conan-Embedding-v2 now supports a context length of 32,768 tokens.
  
- **Conan 1.4B Large Model Trained from Scratch**
  
  A vocabulary and large language model trained from scratch, with a pre-trained model and vocabulary more tailored to the Embedding scenario, delivering stronger performance. 
  
  The Conan-1.4B base model will be open-sourced. Community workers can train their own Embedding models based on the Conan-1.4B base model.
  

## Performance

Performance of Conan-Embedding-v2 on MTEB for Chinese and English

![MTEB Result](./src/mteb_res_v2.png)

**English**

| Embedding TaskMertric   | Class. Acc. (12) | Clust V-Meas. (11) | PairClass AP (3) | Rerank MAP (4) | Retri nDCG @ 10 (15) | STS Spear. (12) | SummSpear. (1) | Avg.(56)  |
|:-----------------------:|:----------------:|:------------------:|:----------------:|:--------------:|:--------------------:|:---------------:|:--------------:|:---------:|
| bge-multilingual-gemma2 | 88.08            | 54.65              | 85.97            | 59.72          | 59.24                | 83.88           | 31.20          | 69.88     |
| e5-mistral-7b-instruct  | 79.89            | 51.44              | 88.42            | 49.78          | 57.62                | 84.32           | **36.57**      | 67.98     |
| gte-Qwen2-7B-instruct   | 86.58            | 56.92              | 85.90            | **61.42**      | 59.11                | 83.06           | 31.35          | 69.95     |
| stella-en-1.5B-v5       | 87.63            | 57.69              | 88.07            | 61.21          | 61.01                | 84.51           | 31.49          | 71.19     |
| bge-en-icl              | 88.95            | 57.89              | 88.14            | 59.86          | 62.16                | 84.24           | 30.77          | 71.67     |
| NV-Embed-v2             | **90.37**        | 58.46              | 88.67            | 60.65          | 62.65                | 84.31           | 30.70          | 72.31     |
| **Conan-embedding-v2**  | 90.15            | **60.86**          | **93.47**        | 60.89          | **66.40**            | **85.73**       | 28.08          | **74.22** |

**Chinese**

| Embedding TaskMertric   | Class.Acc. (9) | ClustV-Meas. (4) | PairClassAP (2) | RerankMAP (4) | RetrinDCG @ 10 (8) | STSSpear. (8) | Avg.(35)  |
|:-----------------------:|:--------------:|:----------------:|:---------------:|:-------------:|:------------------:|:-------------:|:---------:|
| e5-mistral-7b-instruct  | 72.96          | 52.30            | 72.19           | 61.86         | 61.75              | 48.34         | 59.92     |
| gte-Qwen2-1.5B-instruct | 72.53          | 54.61            | 86.91           | 68.21         | 71.86              | 60.05         | 67.12     |
| bge-multilingual-gemma2 | 75.31          | 59.30            | 86.67           | 68.28         | 73.73              | 55.19         | 67.64     |
| gte-Qwen2-7B-instruct   | 75.77          | 66.06            | 87.48           | 68.92         | 75.71              | 65.20         | 71.62     |
| xiaobu-embedding-v2     | 76.53          | 65.17            | 91.87           | 72.58         | 76.50              | 64.18         | 72.36     |
| Conan-embedding-v1      | **76.77**      | 66.33            | 91.66           | 72.76         | 76.67              | 63.67         | 72.50     |
| **Conan-embedding-v2**  | 76.47          | **68.84**        | **92.44**       | **74.41**     | **78.31**          | **65.48**     | **74.24** |


## Model Detail

### Model Structure

**Conan-Embedding-v2 Structure:**

```

SentenceTransformer(  

    (0): Transformer({

        'max_seq_length': 32768, 

        'do_lower_case': False

        }) with Transformer model: ConanEmbedModel,

    (1): Pooling({

        'word_embedding_dimension': 3584, 

        'pooling_mode_cls_token': False, 

        'pooling_mode_mean_tokens': True, 

        'pooling_mode_max_tokens': False, 

        'pooling_mode_mean_sqrt_len_tokens': False, 

        'pooling_mode_weightedmean_tokens': False, 

        'pooling_mode_lasttoken': False, 

        'include_prompt': True

        }),

    (2): Dense({

        'in_features': 3584, 

        'out_features': 3584, 

        'bias': True, 

        'activation_function': 'torch.nn.modules.linear.Identity'

        })

)

```

**Key Specifications of Conan-1.4B (Transformer):**

- Number of Parameters (Non-Dense-Layer): 1.48B 
  
- Vocabulary Size: 150,000
  
- Number of Layers: 8
  
- Hidden Layer Dimension: 3584
  
- Number of Attention Heads (GOA): 32 for Q and 8 for KV
  
- Intermediate Dimension of FFN Layer: 8192
  
- Maximum Context Window: 32,768 Tokens

For more model details, please refer to ```model/modeling_conan.py``` and ```config.json```, or stay tuned for the upcoming open-source release of Conan-1.4B Base Model.

### Tokenizer

We trained the Tokenizer on a large-scale multilingual dataset to build a standard BBPE(Byte-level Byte Pair Encoding) tokenizer with a vocabulary size of 150,000.

## Technical Report

We will soon release our technical report.

## Using Conan-Embedding-v2

Use ```/model/conan_api_client.py``` to access our test API. A sample call is as follows:

```

from modeling_conan import ConanClient



AK = os.getenv("CONAN_AK")

SK = os.getenv("CONAN_SK")

client = ConanClient(ak=AK, sk=SK, url="https://ai.om.qq.com/api/conan/v2")

res = client.embed("Hello!")

print(res)



```

This is a temporary calling solution. Please contact us to obtain an access token.

In the future, we will provide high-performance, cost-effective, and reliable Embedding services on Tencent Cloud.


---

**About**

Created by the Tencent BAC Group. All rights reserved.