Spaces:
Sleeping
Sleeping
# WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia | |
The goal of this project is to mine for parallel sentences in the textual content of Wikipedia for all possible language pairs. | |
## Mined data | |
* 85 different languages, 1620 language pairs | |
* 134M parallel sentences, out of which 34M are aligned with English | |
* this [*table shows the amount of mined parallel sentences for most of the language pairs*](WikiMatrix-sizes.pdf) | |
* the mined bitext are stored on AWS and can de downloaded with the following command: | |
```bash | |
wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/v1/WikiMatrix.en-fr.tsv.gz | |
``` | |
Replace "en-fr" with the ISO codes of the desired language pair. | |
The language pair must be in alphabetical order, e.g. "de-en" and not "en-de". | |
The list of available bitexts and their sizes are given in the file [*list_of_bitexts.txt*](list_of_bitexts.txt). | |
Please do **not loop over all files** since AWs implements some [*limitations*](https://dl.fbaipublicfiles.com/README) to avoid abuse. | |
Use this command if you want to download all 1620 language pairs in one tar file (but this is 65GB!): | |
```bash | |
wget https://dl.fbaipublicfiles.com/laser/WikiMatrix/WikiMatrix.v1.1620_language_pairs.tar | |
``` | |
## Approach | |
We use LASER's bitext mining approach and encoder for 93 languages [2,3]. | |
We do not use the inter-language links provided by Wikipedia, | |
but search over all Wikipedia articles of each language. We approach the | |
computational challenge to mine in almost 600 million sentences by using fast | |
indexing and similarity search with [*FAISS*](https://github.com/facebookresearch/faiss). | |
Prior to mining parallel sentences, we perform | |
sentence segmentation, deduplication and language identification. | |
Please see reference [1] for details. | |
## Data extraction and threshold optimization | |
We provide a tool to extract parallel texts from the the TSV files: | |
```bash | |
python3 extract.py \ | |
--tsv WikiMatrix.en-fr.tsv.gz \ | |
--bitext WikiMatrix.en-fr.txt \ | |
--src-lang en --trg-lang fr \ | |
--threshold 1.04 | |
``` | |
One can specify the threshold on the margin score. | |
The higher the value, the more likely the sentences are mutual translations, but the less data one will get. | |
**A value of 1.04 seems to be good choice for most language pairs.** Please see the analysis in the paper for | |
more information [1]. | |
## Evaluation | |
To assess the quality of the mined bitexts, we trained neural MT system on all language pairs | |
for which we were able to mine at least 25k parallel sentences (with a margin threshold of 1.04). | |
We trained systems in both directions, source to target and target to source, and report BLEU scores | |
on the [*TED test*](https://github.com/neulab/word-embeddings-for-nmt) set proposed in [4]. | |
This totals 1886 different NMT systems. | |
This [*table shows the BLEU scores for the most frequest language pairs*](WikiMatrix-bleu.pdf). | |
We achieve BLEU scores over 30 for several language pairs. | |
The goal is not to build state of the art systems for each language pair, but | |
to get an indication of the quality of the automatically mined data. These | |
BLEU scores should be of course appreciated in context of the sizes of the | |
mined corpora. | |
Obviously, we can not exclude that the | |
provided data contains some wrong alignments even though the margin is large. | |
Finally, we would like to point out that we run our approach on all available | |
languages in Wikipedia, independently of the quality of LASER's sentence | |
embeddings for each one. | |
## License | |
The mined data is distributed under the Creative Commons Attribution-ShareAlike license. | |
Please cite reference [1] if you use this data. | |
## References | |
[1] Holger Schwenk, Vishrav Chaudhary, Shuo Sun, Hongyu Gong and Paco Guzman, | |
[*WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia*](https://arxiv.org/abs/1907.05791) | |
arXiv, July 11 2019. | |
[2] Mikel Artetxe and Holger Schwenk, | |
[*Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings*](https://arxiv.org/abs/1811.01136) | |
arXiv, Nov 3 2018. | |
[3] Mikel Artetxe and Holger Schwenk, | |
[*Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond*](https://arxiv.org/abs/1812.10464) | |
arXiv, Dec 26 2018. | |
[4] Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan and Graham Neubig, | |
[*When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?*](https://www.aclweb.org/anthology/papers/N/N18/N18-2084/) | |
NAACL, pages 529-535, 2018. | |