File size: 6,292 Bytes
6663754
1e9cb1d
 
 
 
 
 
 
 
6663754
 
1e9cb1d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---

tags:
- mteb
- sentence-transformers
- transformers
- sentence-similarity
language:
- en
- zh
license: apache-2.0
---


# Conan-Embedding-v2

## What's New?

- **Performance**
  
  Conan-Embedding-v2 has now achieved SOTA performance on the MTEB leaderboard for both Chinese and English.
  
- **Cross-lingual Retrieval between Chinese and English**
  
  Conan-Embedding-v2 supports cross-lingual retrieval between Chinese and English samples.
  
- **Longer Context Support**
  
  Conan-Embedding-v2 now supports a context length of 32,768 tokens.
  
- **Conan 1.4B Large Model Trained from Scratch**
  
  A vocabulary and large language model trained from scratch, with a pre-trained model and vocabulary more tailored to the Embedding scenario, delivering stronger performance. 
  
  The Conan-1.4B base model will be open-sourced. Community workers can train their own Embedding models based on the Conan-1.4B base model.
  

## Performance

Performance of Conan-Embedding-v2 on MTEB for Chinese and English

![MTEB Result](./src/mteb_res_v2.png)

**English**

| Embedding TaskMertric   | Class. Acc. (12) | Clust V-Meas. (11) | PairClass AP (3) | Rerank MAP (4) | Retri nDCG @ 10 (15) | STS Spear. (12) | SummSpear. (1) | Avg.(56)  |
|:-----------------------:|:----------------:|:------------------:|:----------------:|:--------------:|:--------------------:|:---------------:|:--------------:|:---------:|
| bge-multilingual-gemma2 | 88.08            | 54.65              | 85.97            | 59.72          | 59.24                | 83.88           | 31.20          | 69.88     |
| e5-mistral-7b-instruct  | 79.89            | 51.44              | 88.42            | 49.78          | 57.62                | 84.32           | **36.57**      | 67.98     |
| gte-Qwen2-7B-instruct   | 86.58            | 56.92              | 85.90            | **61.42**      | 59.11                | 83.06           | 31.35          | 69.95     |
| stella-en-1.5B-v5       | 87.63            | 57.69              | 88.07            | 61.21          | 61.01                | 84.51           | 31.49          | 71.19     |
| bge-en-icl              | 88.95            | 57.89              | 88.14            | 59.86          | 62.16                | 84.24           | 30.77          | 71.67     |
| NV-Embed-v2             | **90.37**        | 58.46              | 88.67            | 60.65          | 62.65                | 84.31           | 30.70          | 72.31     |
| **Conan-embedding-v2**  | 90.15            | **60.86**          | **93.47**        | 60.89          | **66.40**            | **85.73**       | 28.08          | **74.22** |

**Chinese**

| Embedding TaskMertric   | Class.Acc. (9) | ClustV-Meas. (4) | PairClassAP (2) | RerankMAP (4) | RetrinDCG @ 10 (8) | STSSpear. (8) | Avg.(35)  |
|:-----------------------:|:--------------:|:----------------:|:---------------:|:-------------:|:------------------:|:-------------:|:---------:|
| e5-mistral-7b-instruct  | 72.96          | 52.30            | 72.19           | 61.86         | 61.75              | 48.34         | 59.92     |
| gte-Qwen2-1.5B-instruct | 72.53          | 54.61            | 86.91           | 68.21         | 71.86              | 60.05         | 67.12     |
| bge-multilingual-gemma2 | 75.31          | 59.30            | 86.67           | 68.28         | 73.73              | 55.19         | 67.64     |
| gte-Qwen2-7B-instruct   | 75.77          | 66.06            | 87.48           | 68.92         | 75.71              | 65.20         | 71.62     |
| xiaobu-embedding-v2     | 76.53          | 65.17            | 91.87           | 72.58         | 76.50              | 64.18         | 72.36     |
| Conan-embedding-v1      | **76.77**      | 66.33            | 91.66           | 72.76         | 76.67              | 63.67         | 72.50     |
| **Conan-embedding-v2**  | 76.47          | **68.84**        | **92.44**       | **74.41**     | **78.31**          | **65.48**     | **74.24** |


## Model Detail

### Model Structure

**Conan-Embedding-v2 Structure:**

```

SentenceTransformer(  

    (0): Transformer({

        'max_seq_length': 32768, 

        'do_lower_case': False

        }) with Transformer model: ConanEmbedModel,

    (1): Pooling({

        'word_embedding_dimension': 3584, 

        'pooling_mode_cls_token': False, 

        'pooling_mode_mean_tokens': True, 

        'pooling_mode_max_tokens': False, 

        'pooling_mode_mean_sqrt_len_tokens': False, 

        'pooling_mode_weightedmean_tokens': False, 

        'pooling_mode_lasttoken': False, 

        'include_prompt': True

        }),

    (2): Dense({

        'in_features': 3584, 

        'out_features': 3584, 

        'bias': True, 

        'activation_function': 'torch.nn.modules.linear.Identity'

        })

)

```

**Key Specifications of Conan-1.4B (Transformer):**

- Number of Parameters (Non-Dense-Layer): 1.48B 
  
- Vocabulary Size: 150,000
  
- Number of Layers: 8
  
- Hidden Layer Dimension: 3584
  
- Number of Attention Heads (GOA): 32 for Q and 8 for KV
  
- Intermediate Dimension of FFN Layer: 8192
  
- Maximum Context Window: 32,768 Tokens

For more model details, please refer to ```model/modeling_conan.py``` and ```config.json```, or stay tuned for the upcoming open-source release of Conan-1.4B Base Model.

### Tokenizer

We trained the Tokenizer on a large-scale multilingual dataset to build a standard BBPE(Byte-level Byte Pair Encoding) tokenizer with a vocabulary size of 150,000.

## Technical Report

We will soon release our technical report.

## Using Conan-Embedding-v2

Use ```/model/conan_api_client.py``` to access our test API. A sample call is as follows:

```

from modeling_conan import ConanClient



AK = os.getenv("CONAN_AK")

SK = os.getenv("CONAN_SK")

client = ConanClient(ak=AK, sk=SK, url="https://ai.om.qq.com/api/conan/v2")

res = client.embed("Hello!")

print(res)



```

This is a temporary calling solution. Please contact us to obtain an access token.

In the future, we will provide high-performance, cost-effective, and reliable Embedding services on Tencent Cloud.


---

**About**

Created by the Tencent BAC Group. All rights reserved.