Tom Aarsen commited on
Commit
9eb610d
·
1 Parent(s): 4ec0aa7

Revert inadvertent config, tokenizer updates

Browse files

This reverts commit 978a6396c1d090b9f701a36d096a7e79d2d05d3a.

Files changed (6) hide show
  1. .gitattributes +0 -1
  2. README.md +84 -84
  3. config.json +32 -35
  4. special_tokens_map.json +1 -51
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +1 -56
.gitattributes CHANGED
@@ -16,4 +16,3 @@
16
  *.pth filter=lfs diff=lfs merge=lfs -text
17
  *tfevents* filter=lfs diff=lfs merge=lfs -text
18
  model.safetensors filter=lfs diff=lfs merge=lfs -text
19
- tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
16
  *.pth filter=lfs diff=lfs merge=lfs -text
17
  *tfevents* filter=lfs diff=lfs merge=lfs -text
18
  model.safetensors filter=lfs diff=lfs merge=lfs -text
 
README.md CHANGED
@@ -1,85 +1,85 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - sentence-transformers/msmarco
5
- language:
6
- - en
7
- - de
8
- base_model:
9
- - microsoft/Multilingual-MiniLM-L12-H384
10
- pipeline_tag: text-ranking
11
- library_name: sentence-transformers
12
- tags:
13
- - transformers
14
- ---
15
- # Cross-Encoder for MS MARCO - EN-DE
16
-
17
- This is a cross-lingual Cross-Encoder model for EN-DE that can be used for passage re-ranking. It was trained on the [MS Marco Passage Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) task.
18
-
19
- The model can be used for Information Retrieval: See [SBERT.net Retrieve & Re-rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html).
20
-
21
- The training code is available in this repository, see `train_script.py`.
22
-
23
-
24
- ## Usage with SentenceTransformers
25
-
26
- When you have [SentenceTransformers](https://www.sbert.net/) installed, you can use the model like this:
27
- ```python
28
- from sentence_transformers import CrossEncoder
29
-
30
- model = CrossEncoder('model_name', max_length=512)
31
-
32
- query = 'How many people live in Berlin?'
33
- docs = ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.']
34
- pairs = [(query, doc) for doc in docs]
35
- scores = model.predict(pairs)
36
- ```
37
-
38
-
39
- ## Usage with Transformers
40
- With the transformers library, you can use the model like this:
41
-
42
- ```python
43
- from transformers import AutoTokenizer, AutoModelForSequenceClassification
44
- import torch
45
-
46
- model = AutoModelForSequenceClassification.from_pretrained('model_name')
47
- tokenizer = AutoTokenizer.from_pretrained('model_name')
48
-
49
- features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt")
50
-
51
- model.eval()
52
- with torch.no_grad():
53
- scores = model(**features).logits
54
- print(scores)
55
- ```
56
-
57
-
58
-
59
-
60
- ## Performance
61
- The performance was evaluated on three datasets:
62
- - **TREC-DL19 EN-EN**: The original [TREC 2019 Deep Learning Track](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html): Given an English query and 1000 documents (retrieved by BM25 lexical search), rank documents with according to their relevance. We compute NDCG@10. BM25 achieves a score of 45.46, a perfect re-ranker can achieve a score of 95.47.
63
- - **TREC-DL19 DE-EN**: The English queries of TREC-DL19 have been translated by a German native speaker to German. We rank the German queries versus the English passages from the original TREC-DL19 setup. We compute NDCG@10.
64
- - **GermanDPR DE-DE**: The [GermanDPR](https://www.deepset.ai/germanquad) dataset provides German queries and German passages from Wikipedia. We indexed the 2.8 Million paragraphs from German Wikipedia and retrieved for each query the top 100 most relevant passages using BM25 lexical search with Elasticsearch. We compute MRR@10. BM25 achieves a score of 35.85, a perfect re-ranker can achieve a score of 76.27.
65
-
66
- We also check the performance of bi-encoders using the same evaluation: The retrieved documents from BM25 lexical search are re-ranked using query & passage embeddings with cosine-similarity. Bi-Encoders can also be used for end-to-end semantic search.
67
-
68
-
69
- | Model-Name | TREC-DL19 EN-EN | TREC-DL19 DE-EN | GermanDPR DE-DE | Docs / Sec |
70
- | ------------- |:-------------:| :-----: | :---: | :----: |
71
- | BM25 | 45.46 | - | 35.85 | -|
72
- | **Cross-Encoder Re-Rankers** | | | |
73
- | [cross-encoder/msmarco-MiniLM-L6-en-de-v1](https://huggingface.co/cross-encoder/msmarco-MiniLM-L6-en-de-v1) | 72.43 | 65.53 | 46.77 | 1600 |
74
- | [cross-encoder/msmarco-MiniLM-L12-en-de-v1](https://huggingface.co/cross-encoder/msmarco-MiniLM-L12-en-de-v1) | 72.94 | 66.07 | 49.91 | 900 |
75
- | [svalabs/cross-electra-ms-marco-german-uncased](https://huggingface.co/svalabs/cross-electra-ms-marco-german-uncased) (DE only) | - | - | 53.67 | 260 |
76
- | [deepset/gbert-base-germandpr-reranking](https://huggingface.co/deepset/gbert-base-germandpr-reranking) (DE only) | - | - | 53.59 | 260 |
77
- | **Bi-Encoders (re-ranking)** | | | |
78
- | [sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned](https://huggingface.co/sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned) | 63.38 | 58.28 | 37.88 | 940 |
79
- | [sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch](https://huggingface.co/sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch) | 65.51 | 58.69 | 38.32 | 940 |
80
- | [svalabs/bi-electra-ms-marco-german-uncased](https://huggingface.co/svalabs/bi-electra-ms-marco-german-uncased) (DE only) | - | - | 34.31 | 450 |
81
- | [deepset/gbert-base-germandpr-question_encoder](https://huggingface.co/deepset/gbert-base-germandpr-question_encoder) (DE only) | - | - | 42.55 | 450 |
82
-
83
-
84
-
85
  Note: Docs / Sec gives the number of (query, document) pairs we can re-rank within a second on a V100 GPU.
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - sentence-transformers/msmarco
5
+ language:
6
+ - en
7
+ - de
8
+ base_model:
9
+ - microsoft/Multilingual-MiniLM-L12-H384
10
+ pipeline_tag: text-ranking
11
+ library_name: sentence-transformers
12
+ tags:
13
+ - transformers
14
+ ---
15
+ # Cross-Encoder for MS MARCO - EN-DE
16
+
17
+ This is a cross-lingual Cross-Encoder model for EN-DE that can be used for passage re-ranking. It was trained on the [MS Marco Passage Ranking](https://github.com/microsoft/MSMARCO-Passage-Ranking) task.
18
+
19
+ The model can be used for Information Retrieval: See [SBERT.net Retrieve & Re-rank](https://www.sbert.net/examples/applications/retrieve_rerank/README.html).
20
+
21
+ The training code is available in this repository, see `train_script.py`.
22
+
23
+
24
+ ## Usage with SentenceTransformers
25
+
26
+ When you have [SentenceTransformers](https://www.sbert.net/) installed, you can use the model like this:
27
+ ```python
28
+ from sentence_transformers import CrossEncoder
29
+
30
+ model = CrossEncoder('model_name', max_length=512)
31
+
32
+ query = 'How many people live in Berlin?'
33
+ docs = ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.']
34
+ pairs = [(query, doc) for doc in docs]
35
+ scores = model.predict(pairs)
36
+ ```
37
+
38
+
39
+ ## Usage with Transformers
40
+ With the transformers library, you can use the model like this:
41
+
42
+ ```python
43
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
44
+ import torch
45
+
46
+ model = AutoModelForSequenceClassification.from_pretrained('model_name')
47
+ tokenizer = AutoTokenizer.from_pretrained('model_name')
48
+
49
+ features = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt")
50
+
51
+ model.eval()
52
+ with torch.no_grad():
53
+ scores = model(**features).logits
54
+ print(scores)
55
+ ```
56
+
57
+
58
+
59
+
60
+ ## Performance
61
+ The performance was evaluated on three datasets:
62
+ - **TREC-DL19 EN-EN**: The original [TREC 2019 Deep Learning Track](https://microsoft.github.io/msmarco/TREC-Deep-Learning-2019.html): Given an English query and 1000 documents (retrieved by BM25 lexical search), rank documents with according to their relevance. We compute NDCG@10. BM25 achieves a score of 45.46, a perfect re-ranker can achieve a score of 95.47.
63
+ - **TREC-DL19 DE-EN**: The English queries of TREC-DL19 have been translated by a German native speaker to German. We rank the German queries versus the English passages from the original TREC-DL19 setup. We compute NDCG@10.
64
+ - **GermanDPR DE-DE**: The [GermanDPR](https://www.deepset.ai/germanquad) dataset provides German queries and German passages from Wikipedia. We indexed the 2.8 Million paragraphs from German Wikipedia and retrieved for each query the top 100 most relevant passages using BM25 lexical search with Elasticsearch. We compute MRR@10. BM25 achieves a score of 35.85, a perfect re-ranker can achieve a score of 76.27.
65
+
66
+ We also check the performance of bi-encoders using the same evaluation: The retrieved documents from BM25 lexical search are re-ranked using query & passage embeddings with cosine-similarity. Bi-Encoders can also be used for end-to-end semantic search.
67
+
68
+
69
+ | Model-Name | TREC-DL19 EN-EN | TREC-DL19 DE-EN | GermanDPR DE-DE | Docs / Sec |
70
+ | ------------- |:-------------:| :-----: | :---: | :----: |
71
+ | BM25 | 45.46 | - | 35.85 | -|
72
+ | **Cross-Encoder Re-Rankers** | | | |
73
+ | [cross-encoder/msmarco-MiniLM-L6-en-de-v1](https://huggingface.co/cross-encoder/msmarco-MiniLM-L6-en-de-v1) | 72.43 | 65.53 | 46.77 | 1600 |
74
+ | [cross-encoder/msmarco-MiniLM-L12-en-de-v1](https://huggingface.co/cross-encoder/msmarco-MiniLM-L12-en-de-v1) | 72.94 | 66.07 | 49.91 | 900 |
75
+ | [svalabs/cross-electra-ms-marco-german-uncased](https://huggingface.co/svalabs/cross-electra-ms-marco-german-uncased) (DE only) | - | - | 53.67 | 260 |
76
+ | [deepset/gbert-base-germandpr-reranking](https://huggingface.co/deepset/gbert-base-germandpr-reranking) (DE only) | - | - | 53.59 | 260 |
77
+ | **Bi-Encoders (re-ranking)** | | | |
78
+ | [sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned](https://huggingface.co/sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-lng-aligned) | 63.38 | 58.28 | 37.88 | 940 |
79
+ | [sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch](https://huggingface.co/sentence-transformers/msmarco-distilbert-multilingual-en-de-v2-tmp-trained-scratch) | 65.51 | 58.69 | 38.32 | 940 |
80
+ | [svalabs/bi-electra-ms-marco-german-uncased](https://huggingface.co/svalabs/bi-electra-ms-marco-german-uncased) (DE only) | - | - | 34.31 | 450 |
81
+ | [deepset/gbert-base-germandpr-question_encoder](https://huggingface.co/deepset/gbert-base-germandpr-question_encoder) (DE only) | - | - | 42.55 | 450 |
82
+
83
+
84
+
85
  Note: Docs / Sec gives the number of (query, document) pairs we can re-rank within a second on a V100 GPU.
config.json CHANGED
@@ -1,35 +1,32 @@
1
- {
2
- "architectures": [
3
- "BertForSequenceClassification"
4
- ],
5
- "attention_probs_dropout_prob": 0.1,
6
- "classifier_dropout": null,
7
- "gradient_checkpointing": false,
8
- "hidden_act": "gelu",
9
- "hidden_dropout_prob": 0.1,
10
- "hidden_size": 384,
11
- "id2label": {
12
- "0": "LABEL_0"
13
- },
14
- "initializer_range": 0.02,
15
- "intermediate_size": 1536,
16
- "label2id": {
17
- "LABEL_0": 0
18
- },
19
- "layer_norm_eps": 1e-12,
20
- "max_position_embeddings": 512,
21
- "model_type": "bert",
22
- "num_attention_heads": 12,
23
- "num_hidden_layers": 6,
24
- "pad_token_id": 0,
25
- "position_embedding_type": "absolute",
26
- "sentence_transformers": {
27
- "activation_fn": "torch.nn.modules.linear.Identity",
28
- "version": "4.1.0.dev0"
29
- },
30
- "tokenizer_class": "XLMRobertaTokenizer",
31
- "transformers_version": "4.52.0.dev0",
32
- "type_vocab_size": 2,
33
- "use_cache": true,
34
- "vocab_size": 250037
35
- }
 
1
+ {
2
+ "_name_or_path": "microsoft/Multilingual-MiniLM-L12-H384",
3
+ "architectures": [
4
+ "BertForSequenceClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 384,
11
+ "id2label": {
12
+ "0": "LABEL_0"
13
+ },
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 1536,
16
+ "label2id": {
17
+ "LABEL_0": 0
18
+ },
19
+ "layer_norm_eps": 1e-12,
20
+ "max_position_embeddings": 512,
21
+ "model_type": "bert",
22
+ "num_attention_heads": 12,
23
+ "num_hidden_layers": 6,
24
+ "pad_token_id": 0,
25
+ "position_embedding_type": "absolute",
26
+ "tokenizer_class": "XLMRobertaTokenizer",
27
+ "transformers_version": "4.6.1",
28
+ "type_vocab_size": 2,
29
+ "use_cache": true,
30
+ "vocab_size": 250037,
31
+ "sbert_ce_default_activation_function": "torch.nn.modules.linear.Identity"
32
+ }
 
 
 
special_tokens_map.json CHANGED
@@ -1,51 +1 @@
1
- {
2
- "bos_token": {
3
- "content": "<s>",
4
- "lstrip": false,
5
- "normalized": false,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "cls_token": {
10
- "content": "<s>",
11
- "lstrip": false,
12
- "normalized": false,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "eos_token": {
17
- "content": "</s>",
18
- "lstrip": false,
19
- "normalized": false,
20
- "rstrip": false,
21
- "single_word": false
22
- },
23
- "mask_token": {
24
- "content": "<mask>",
25
- "lstrip": false,
26
- "normalized": false,
27
- "rstrip": false,
28
- "single_word": false
29
- },
30
- "pad_token": {
31
- "content": "<pad>",
32
- "lstrip": false,
33
- "normalized": false,
34
- "rstrip": false,
35
- "single_word": false
36
- },
37
- "sep_token": {
38
- "content": "</s>",
39
- "lstrip": false,
40
- "normalized": false,
41
- "rstrip": false,
42
- "single_word": false
43
- },
44
- "unk_token": {
45
- "content": "<unk>",
46
- "lstrip": false,
47
- "normalized": false,
48
- "rstrip": false,
49
- "single_word": false
50
- }
51
- }
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "unk_token": "<unk>", "sep_token": "</s>", "pad_token": "<pad>", "cls_token": "<s>", "mask_token": "<mask>"}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,56 +1 @@
1
- {
2
- "added_tokens_decoder": {
3
- "0": {
4
- "content": "<s>",
5
- "lstrip": false,
6
- "normalized": false,
7
- "rstrip": false,
8
- "single_word": false,
9
- "special": true
10
- },
11
- "1": {
12
- "content": "<pad>",
13
- "lstrip": false,
14
- "normalized": false,
15
- "rstrip": false,
16
- "single_word": false,
17
- "special": true
18
- },
19
- "2": {
20
- "content": "</s>",
21
- "lstrip": false,
22
- "normalized": false,
23
- "rstrip": false,
24
- "single_word": false,
25
- "special": true
26
- },
27
- "3": {
28
- "content": "<unk>",
29
- "lstrip": false,
30
- "normalized": false,
31
- "rstrip": false,
32
- "single_word": false,
33
- "special": true
34
- },
35
- "250001": {
36
- "content": "<mask>",
37
- "lstrip": false,
38
- "normalized": false,
39
- "rstrip": false,
40
- "single_word": false,
41
- "special": true
42
- }
43
- },
44
- "bos_token": "<s>",
45
- "clean_up_tokenization_spaces": false,
46
- "cls_token": "<s>",
47
- "eos_token": "</s>",
48
- "extra_special_tokens": {},
49
- "mask_token": "<mask>",
50
- "model_max_length": 512,
51
- "pad_token": "<pad>",
52
- "sep_token": "</s>",
53
- "sp_model_kwargs": {},
54
- "tokenizer_class": "XLMRobertaTokenizer",
55
- "unk_token": "<unk>"
56
- }
 
1
+ {"bos_token": "<s>", "eos_token": "</s>", "sep_token": "</s>", "cls_token": "<s>", "unk_token": "<unk>", "pad_token": "<pad>", "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "special_tokens_map_file": "/root/.cache/huggingface/transformers/8ed73a1ab9ef4e90a9451497bf96cfc38d34354352838a143f2dda1c81aed5ca.0dc5b1041f62041ebbd23b1297f2f573769d5c97d8b7c28180ec86b8f6185aa8", "name_or_path": "microsoft/Multilingual-MiniLM-L12-H384", "sp_model_kwargs": {}}