File size: 1,795 Bytes
8dfab00
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# LASER: calculation of sentence embeddings

Tool to calculate sentence embeddings for an arbitrary text file:
```
bash ./embed.sh INPUT-FILE OUTPUT-FILE [LANGUAGE]
```

The input will first be tokenized, and then sentence embeddings will be generated. If a `language` is specified, 
then `embed.sh` will look for a language-specific LASER3 encoder using the format: `{model_dir}/laser3-{language}.{version}.pt`. 
Otherwise it will default to LASER2 which covers the same 93 languages as [the original LASER encoder](https://arxiv.org/pdf/1812.10464.pdf).

**NOTE:** please set the model location (`model_dir` in `embed.sh`) before running. We recommend to download the models from the NLLB 
release (see [here](/nllb/README.md)). Optionally you can also select the model version number for downloaded LASER3 models. This currently defaults to: `1` (initial release).

## Output format

The embeddings are stored in float32 matrices in raw binary format.
They can be read in Python by:
```
import numpy as np
dim = 1024
X = np.fromfile("my_embeddings.bin", dtype=np.float32, count=-1)                                                                          
X.resize(X.shape[0] // dim, dim)                                                                                                 
```
X is a N x 1024 matrix where N is the number of lines in the text file.
        
## Examples

In order to encode an input text in any of the 93 languages supported by LASER2 (e.g. Afrikaans, English, French):
```
./embed.sh input_file output_file
```

To use a language-specific encoder (if available), such as for example: Wolof, Hausa, or Irish:
```
./embed.sh input_file output_file wol_Latn
```
```
./embed.sh input_file output_file hau_Latn
```
```
./embed.sh input_file output_file gle_Latn
```