File size: 2,184 Bytes
05d3571
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# LASER: xSIM (multilingual similarity search)

This README shows how to calculate the xsim (multilingual similarity) error rate for a given language pair.

xSIM returns the error rate for encoding bitexts into the same embedding space i.e., given a bitext 
with source language embeddings X, and target language embeddings Y, xSIM aligns the embeddings from 
X and Y based on a margin-based similarity, and then returns the percentage of incorrect alignments.

xSIM offers three margin-based scoring options (discussed in detail [here](https://arxiv.org/pdf/1811.01136.pdf)):
- distance
- ratio
- absolute

## Example usage

### Sample script

Simply run the example script `bash ./eval.sh` to download a sample dataset (flores200), a sample encoder (laser2), 
and calculate the sentence embeddings and the xSIM error rate for a set of (comma separated) languages.

You can also calculate xsim for encoders hosted on [HuggingFace sentence-transformers](https://huggingface.co/sentence-transformers). For example, to use LaBSE you can modify/add the following arguments in the sample script:
```
--src-encoder LaBSE
--use-hugging-face
--embedding-dimension 768
```
Note: for HuggingFace encoders there is no need to specify `--src-spm-model`.

### Python

Import xsim

```
from xsim import xSIM
```
Calculate xsim from either numpy float arrays (e.g. np.float32) or binary embedding files
```
# A: numpy arrays x and y

err, nbex = xSIM(x, y)

# B: binary embedding files x and y

fp16_flag = False     # set true if embeddings are saved in 16 bit
embedding_dim = 1024  # set dimension of saved embeddings
err, nbex = xSIM(
  x, 
  y, 
  dim=embedding_dim, 
  fp16=fp16_flag
)
```
Error type
```
# A: textual-based error (allows for duplicates)

tgt_text = "/path/to/target-text-file"
err, nbex = xSIM(x, y, eval_text=tgt_text)

# B: index-based error (default)

err, nbex = xSIM(x, y)
```
Margin selection
```
# A: ratio (default)
err, nbex = xSIM(x, y)

# B: distance
err, nbex = xSIM(x, y, margin='distance')

# C: absolute
err, nbex = xSIM(x, y, margin='absolute')
```
Finally, to calculate the error rate simply return: `100 * err / nbex` (number of errors over total examples).