Model outputs 768 dim embeddings instead of 1024 as mentioned

#1
by Bhanu3 - opened

Hello,

I'm trying out the JobBERT-v2 model from the sentence-transformers library. According to the documentation and my understanding, this model is supposed to output 1024-dimensional embeddings. However, during inference, I'm receiving 768-dimensional embeddings.

I suspect that the Asym layer is primarily designed for training scenarios where embeddings like "anchor" and "positive" are compared or contrasted. During inference, using such layers without the corresponding training dynamics might not yield the expected transformations.

Is the Asym layer intended only for training purposes in the JobBERT-v2 model? or if I am doing it wrong?

Model Name: jensjorisdecorte/JobBERT-v2
Library Versions:
sentence-transformers: 3.1.0
transformers: 4.44.2
torch: 2.4.1+cu118
Python Version: 3.8
Device: CUDA

Thanks,

Hi @Bhanu3 ,

By default, the data is not passed through the Asym layer at inference. This is due to the design of the sentence-transformers package. To make sure that job titles are passed through the right Asym layer, please follow the code example in the readme:

import torch
import numpy as np
from tqdm.auto import tqdm
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import batch_to_device, cos_sim

# Load the model
model = SentenceTransformer("TechWolf/JobBERT-v2")

def encode_batch(jobbert_model, texts):
    features = jobbert_model.tokenize(texts)
    features = batch_to_device(features, jobbert_model.device)
    features["text_keys"] = ["anchor"]
    with torch.no_grad():
        out_features = jobbert_model.forward(features)
    return out_features["sentence_embedding"].cpu().numpy()

def encode(jobbert_model, texts, batch_size: int = 8):
    # Sort texts by length and keep track of original indices
    sorted_indices = np.argsort([len(text) for text in texts])
    sorted_texts = [texts[i] for i in sorted_indices]
    
    embeddings = []
    
    # Encode in batches
    for i in tqdm(range(0, len(sorted_texts), batch_size)):
        batch = sorted_texts[i:i+batch_size]
        embeddings.append(encode_batch(jobbert_model, batch))
    
    # Concatenate embeddings and reorder to original indices
    sorted_embeddings = np.concatenate(embeddings)
    original_order = np.argsort(sorted_indices)
    return sorted_embeddings[original_order]

# Example usage
embeddings = encode(model, [...])

# Calculate cosine similarity matrix
similarities = cos_sim(embeddings, embeddings)
print(similarities)

Hi @Bhanu3 ,

Got it! Thank you very much for the clarification.

Sign up or log in to comment