E5 Text Embeddings

Improving Text Embeddings with Large Language Models. Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, Furu Wei, arXiv 2024

Text Embeddings by Weakly-Supervised Contrastive Pre-training. Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022

LLM based Models

	BEIR	# of layers	embedding dimension	Huggingface
E5-mistral-7b-instruct	56.9	32	4096	intfloat/e5-mistral-7b-instruct

English Pre-trained Models

	BEIR	# of layers	embedding dimension	Huggingface
E5-small-v2	49.0	12	384	intfloat/e5-small-v2
E5-base-v2	50.3	12	768	intfloat/e5-base-v2
E5-large-v2	50.6	24	1024	intfloat/e5-large-v2

E5-small	46.0	12	384	intfloat/e5-small
E5-base	48.8	12	768	intfloat/e5-base
E5-large	50.0	24	1024	intfloat/e5-large

E5-small-unsupervised	40.8	12	384	intfloat/e5-small-unsupervised
E5-base-unsupervised	42.9	12	768	intfloat/e5-base-unsupervised
E5-large-unsupervised	44.2	24	1024	intfloat/e5-large-unsupervised

The models with -unsupervised suffix only pre-trains on unlabeled datasets.

Multilingual Pre-trained Models

	BEIR	# of layers	embedding dimension	Huggingface
multilingual-e5-small	46.6	12	384	intfloat/multilingual-e5-small
multilingual-e5-base	48.9	12	768	intfloat/multilingual-e5-base
multilingual-e5-large	51.4	24	1024	intfloat/multilingual-e5-large

Install Python Package Requirements

pip install -r requirements.txt

For e5-mistral-7b-instruct, it would require transformers>=4.34 to load Mistral model.

Evaluate on the BEIR Benchmark

After installing the required python packages, run the following command on GPU machines:

bash scripts/eval_mteb_beir.sh intfloat/e5-small-v2

By default, the evaluation script will use all the available GPUs.

Caution: it could take quite a long time (~10 hours) due to corpus encoding. For intfloat/e5-mistral-7b-instruct, it could take even longer (several days).

Evaluate on the MTEB Benchmark

Run the following command:

bash scripts/eval_mteb_except_retrieval.sh intfloat/e5-small-v2

For multilingual models, simply add a --multilingual suffix:

bash scripts/eval_mteb_except_retrieval.sh intfloat/multilingual-e5-base --multilingual

Other Resources

The data for our proposed synthetic task personalized passkey retrieval is available at https://huggingface.co/datasets/intfloat/personalized_passkey_retrieval.

Troubleshooting

If you encounter OOM error, please try to reduce the batch size.

Citation

If you find our paper or models helpful, please consider cite as follows:

@article{wang2023improving,
  title={Improving Text Embeddings with Large Language Models},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2401.00368},
  year={2023}
}

@article{wang2022text,
  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2212.03533},
  year={2022}
}

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Microsoft Open Source Code of Conduct