File size: 6,292 Bytes
6663754 1e9cb1d 6663754 1e9cb1d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
tags:
- mteb
- sentence-transformers
- transformers
- sentence-similarity
language:
- en
- zh
license: apache-2.0
---
# Conan-Embedding-v2
## What's New?
- **Performance**
Conan-Embedding-v2 has now achieved SOTA performance on the MTEB leaderboard for both Chinese and English.
- **Cross-lingual Retrieval between Chinese and English**
Conan-Embedding-v2 supports cross-lingual retrieval between Chinese and English samples.
- **Longer Context Support**
Conan-Embedding-v2 now supports a context length of 32,768 tokens.
- **Conan 1.4B Large Model Trained from Scratch**
A vocabulary and large language model trained from scratch, with a pre-trained model and vocabulary more tailored to the Embedding scenario, delivering stronger performance.
The Conan-1.4B base model will be open-sourced. Community workers can train their own Embedding models based on the Conan-1.4B base model.
## Performance
Performance of Conan-Embedding-v2 on MTEB for Chinese and English

**English**
| Embedding TaskMertric | Class. Acc. (12) | Clust V-Meas. (11) | PairClass AP (3) | Rerank MAP (4) | Retri nDCG @ 10 (15) | STS Spear. (12) | SummSpear. (1) | Avg.(56) |
|:-----------------------:|:----------------:|:------------------:|:----------------:|:--------------:|:--------------------:|:---------------:|:--------------:|:---------:|
| bge-multilingual-gemma2 | 88.08 | 54.65 | 85.97 | 59.72 | 59.24 | 83.88 | 31.20 | 69.88 |
| e5-mistral-7b-instruct | 79.89 | 51.44 | 88.42 | 49.78 | 57.62 | 84.32 | **36.57** | 67.98 |
| gte-Qwen2-7B-instruct | 86.58 | 56.92 | 85.90 | **61.42** | 59.11 | 83.06 | 31.35 | 69.95 |
| stella-en-1.5B-v5 | 87.63 | 57.69 | 88.07 | 61.21 | 61.01 | 84.51 | 31.49 | 71.19 |
| bge-en-icl | 88.95 | 57.89 | 88.14 | 59.86 | 62.16 | 84.24 | 30.77 | 71.67 |
| NV-Embed-v2 | **90.37** | 58.46 | 88.67 | 60.65 | 62.65 | 84.31 | 30.70 | 72.31 |
| **Conan-embedding-v2** | 90.15 | **60.86** | **93.47** | 60.89 | **66.40** | **85.73** | 28.08 | **74.22** |
**Chinese**
| Embedding TaskMertric | Class.Acc. (9) | ClustV-Meas. (4) | PairClassAP (2) | RerankMAP (4) | RetrinDCG @ 10 (8) | STSSpear. (8) | Avg.(35) |
|:-----------------------:|:--------------:|:----------------:|:---------------:|:-------------:|:------------------:|:-------------:|:---------:|
| e5-mistral-7b-instruct | 72.96 | 52.30 | 72.19 | 61.86 | 61.75 | 48.34 | 59.92 |
| gte-Qwen2-1.5B-instruct | 72.53 | 54.61 | 86.91 | 68.21 | 71.86 | 60.05 | 67.12 |
| bge-multilingual-gemma2 | 75.31 | 59.30 | 86.67 | 68.28 | 73.73 | 55.19 | 67.64 |
| gte-Qwen2-7B-instruct | 75.77 | 66.06 | 87.48 | 68.92 | 75.71 | 65.20 | 71.62 |
| xiaobu-embedding-v2 | 76.53 | 65.17 | 91.87 | 72.58 | 76.50 | 64.18 | 72.36 |
| Conan-embedding-v1 | **76.77** | 66.33 | 91.66 | 72.76 | 76.67 | 63.67 | 72.50 |
| **Conan-embedding-v2** | 76.47 | **68.84** | **92.44** | **74.41** | **78.31** | **65.48** | **74.24** |
## Model Detail
### Model Structure
**Conan-Embedding-v2 Structure:**
```
SentenceTransformer(
(0): Transformer({
'max_seq_length': 32768,
'do_lower_case': False
}) with Transformer model: ConanEmbedModel,
(1): Pooling({
'word_embedding_dimension': 3584,
'pooling_mode_cls_token': False,
'pooling_mode_mean_tokens': True,
'pooling_mode_max_tokens': False,
'pooling_mode_mean_sqrt_len_tokens': False,
'pooling_mode_weightedmean_tokens': False,
'pooling_mode_lasttoken': False,
'include_prompt': True
}),
(2): Dense({
'in_features': 3584,
'out_features': 3584,
'bias': True,
'activation_function': 'torch.nn.modules.linear.Identity'
})
)
```
**Key Specifications of Conan-1.4B (Transformer):**
- Number of Parameters (Non-Dense-Layer): 1.48B
- Vocabulary Size: 150,000
- Number of Layers: 8
- Hidden Layer Dimension: 3584
- Number of Attention Heads (GOA): 32 for Q and 8 for KV
- Intermediate Dimension of FFN Layer: 8192
- Maximum Context Window: 32,768 Tokens
For more model details, please refer to ```model/modeling_conan.py``` and ```config.json```, or stay tuned for the upcoming open-source release of Conan-1.4B Base Model.
### Tokenizer
We trained the Tokenizer on a large-scale multilingual dataset to build a standard BBPE(Byte-level Byte Pair Encoding) tokenizer with a vocabulary size of 150,000.
## Technical Report
We will soon release our technical report.
## Using Conan-Embedding-v2
Use ```/model/conan_api_client.py``` to access our test API. A sample call is as follows:
```
from modeling_conan import ConanClient
AK = os.getenv("CONAN_AK")
SK = os.getenv("CONAN_SK")
client = ConanClient(ak=AK, sk=SK, url="https://ai.om.qq.com/api/conan/v2")
res = client.embed("Hello!")
print(res)
```
This is a temporary calling solution. Please contact us to obtain an access token.
In the future, we will provide high-performance, cost-effective, and reliable Embedding services on Tencent Cloud.
---
**About**
Created by the Tencent BAC Group. All rights reserved. |