|
---
|
|
language:
|
|
- ja
|
|
library_name: transformers
|
|
license: apache-2.0
|
|
pipeline_tag: sentence-similarity
|
|
tags:
|
|
- feature-extraction
|
|
- sentence-similarity
|
|
- transformers
|
|
---
|
|
|
|
# PLaMo-Embedding-1B
|
|
|
|
[日本語版のREADME/Japanese README](README_ja.md)
|
|
|
|
## Model Overview
|
|
|
|
PLaMo-Embedding-1B is a Japanese text embedding model developed by Preferred Networks, Inc.
|
|
|
|
It can convert Japanese text input into numerical vectors and can be used for a wide range of applications, including information retrieval, text classification, and clustering.
|
|
|
|
As of early April 2025, it achieved top-class scores on [JMTEB](https://github.com/sbintuitions/JMTEB), a benchmark for Japanese text embedding. It demonstrated particularly outstanding performance, especially in retrieval tasks.
|
|
|
|
PLaMo-Embedding-1B is released under the [Apache v2.0](https://www.apache.org/licenses/LICENSE-2.0) license, and you can use it freely, including for commercial purposes.
|
|
|
|
For technical details, please refer to the following Tech Blog post (Ja): https://tech.preferred.jp/ja/blog/plamo-embedding-1b/
|
|
|
|
## Usage
|
|
|
|
### Requirements
|
|
```
|
|
sentencepiece
|
|
torch
|
|
transformers
|
|
```
|
|
|
|
### Sample Code
|
|
```python
|
|
import torch
|
|
import torch.nn.functional as F
|
|
from transformers import AutoModel, AutoTokenizer
|
|
|
|
# You can download models from the Hugging Face Hub 🤗 as follows:
|
|
tokenizer = AutoTokenizer.from_pretrained("pfnet/plamo-embedding-1b", trust_remote_code=True)
|
|
model = AutoModel.from_pretrained("pfnet/plamo-embedding-1b", trust_remote_code=True)
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu"
|
|
model = model.to(device)
|
|
|
|
query = "PLaMo-Embedding-1Bとは何ですか?"
|
|
documents = [
|
|
"PLaMo-Embedding-1Bは、Preferred Networks, Inc. によって開発された日本語テキスト埋め込みモデルです。",
|
|
"最近は随分と暖かくなりましたね。"
|
|
]
|
|
|
|
with torch.inference_mode():
|
|
# For embedding query texts in information retrieval, please use the `encode_query` method.
|
|
# You also need to pass the `tokenizer`.
|
|
query_embedding = model.encode_query(query, tokenizer)
|
|
# For other texts/sentences, please use the `encode_document` method.
|
|
# Also, for applications other than information retrieval, please use the `encode_document` method.
|
|
document_embeddings = model.encode_document(documents, tokenizer)
|
|
|
|
# The similarity between vectors obtained by inputting sentences into the model is high for similar sentences and low for dissimilar sentences.
|
|
# This feature can be utilized for applications such as information retrieval.
|
|
similarities = F.cosine_similarity(query_embedding, document_embeddings)
|
|
print(similarities)
|
|
# tensor([0.8812, 0.5533])
|
|
```
|
|
|
|
Note: For `encode_document` and `encode_query`, texts exceeding the model's maximum context length of 4096 will be truncated. Be especially aware that for `encode_query`, a prefix is added internally, making the effective maximum context length slightly shorter.
|
|
|
|
|
|
## Benchmarks
|
|
We conducted a performance evaluation using [JMTEB](https://github.com/sbintuitions/JMTEB), a benchmark for Japanese text embedding.
|
|
|
|
Model |Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
|
|
|:----------------------------------------------|:----------|:------------|:----------|:-----------------|:------------|:-------------|:---------------------|
|
|
| [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) |70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
|
|
| [pkshatech/GLuCoSE-base-ja-v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2) |72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | 62.37 |
|
|
| [OpenAI/text-embedding-3-large](https://openai.com/index/new-embedding-models-and-api-updates/) |74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
|
|
| [cl-nagoya/ruri-large-v2](https://huggingface.co/cl-nagoya/ruri-large-v2) |74.55 | 76.34 | 83.17 | 77.18 | 93.21 | 52.14 | 62.27 |
|
|
|[Sarashina-Embedding-v1-1B](https://huggingface.co/sbintuitions/sarashina-embedding-v1-1b)|75.50|77.61|82.71|**78.37**|**93.74**|**53.86**|62.00|
|
|
|||
|
|
|[**PLaMo-Embedding-1B**](https://huggingface.co/pfnet/plamo-embedding-1b) (This model) (*)|**76.10**|**79.94**|**83.14**|77.20|93.57|53.47|62.37|
|
|
|
|
(*): Measured with a context length of 1024. Although the model supports a context length of up to 4096, we measured at 1024 because the context length included during training was up to 1024. However, it is known that evaluating at 4096 does not significantly affect the average score. (Ref: [Tech Blog (Ja)](https://tech.preferred.jp/ja/blog/plamo-embedding-1b/))。
|
|
|
|
## Model Details
|
|
|
|
- Model Size: 1B
|
|
- Maximum Context Length: 4096 tokens
|
|
- Embedding Dimensionality: 2048
|
|
- Similarity Function: cosine similarity
|
|
- Developer: Preferred Networks, Inc
|
|
- Language: Japanese
|
|
- License: [Apache v2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
|
|
|
## License
|
|
|
|
PLaMo-Embedding-1B is released under the [Apache v2.0](https://www.apache.org/licenses/LICENSE-2.0) license, and you can use it freely, including for commercial purposes.
|
|
|
|
## How to cite
|
|
|
|
```
|
|
@online{PLaMoEmbedding1B,
|
|
author = {Preferred Networks, Inc},
|
|
title = {PLaMo-Embedding-1B},
|
|
year = {2025},
|
|
url = {https://huggingface.co/pfnet/plamo-embedding-1b},
|
|
urldate = {2025-04-17}
|
|
}
|
|
``` |