T-VEC / README.md
NetoAI's picture
Update README.md
2c8f57a verified
metadata
license: mit
language:
  - en
tags:
  - text-embeddings
  - telecom
  - domain-adaptation
  - triplet-loss
  - transformer
  - semantic-search
  - sentence-transformers
  - domain-specific
  - contrastive-learning
  - simcse
  - bio-bert
  - don’t-stop-pretraining
metrics:
  - name: Telecom Triplet Score
    type: accuracy
    value: 0.938
    verified: false
  - name: Average MTEB Score
    type: accuracy
    value: 0.825
    verified: false
  - name: Average STS Score
    type: spearman
    value: 82.19
    verified: false
  - name: AllNLI Triplet Score
    type: accuracy
    value: 0.615
    verified: false
base_model:
  - Alibaba-NLP/gte-Qwen2-1.5B-instruct
model-index:
  - name: T-VEC
    results:
      - task:
          type: text-embedding
          name: Telecom Triplet Benchmark
        dataset:
          type: custom
          name: Telecom Triplet Benchmark
        metrics:
          - name: Telecom Triplet Score
            type: accuracy
            value: 0.938
            verified: false
      - task:
          type: text-embedding
          name: MTEB Benchmark
        dataset:
          type: openai_humaneval
          name: MTEB Benchmark
        metrics:
          - name: Average MTEB Score
            type: accuracy
            value: 0.825
            verified: false
      - task:
          type: text-embedding
          name: STS Benchmark
        dataset:
          type: openai_humaneval
          name: STS Benchmark
        metrics:
          - name: Average STS Score
            type: spearman
            value: 82.19
            verified: false
      - task:
          type: text-embedding
          name: AllNLI Triplet
        dataset:
          type: openai_humaneval
          name: AllNLI Triplet
        metrics:
          - name: Triplet Score
            type: accuracy
            value: 0.615
            verified: false
extra_gated_prompt: Please provide answers to the below questions to gain access to the model
extra_gated_fields:
  Company: text
  Full Name: text
  Email: text
  I want to use this model for:
    type: select
    options:
      - Research
      - Education
      - Commercial
      - label: Other
        value: other

T-VEC: A Telecom-Specific Text Embedding Model

Overview

T-VEC (Telecom Vectorization Model) is a domain-adapted text embedding model developed by NetoAI and fine-tuned from Alibaba-NLP/gte-Qwen2-1.5B-instruct. Using a deeply supervised triplet-loss approach, T-VEC learns rich semantic representations tailored to telecom use cases, achieving state-of-the-art results on custom and standard benchmarks.

Model Details

  • Model Name: T-VEC
  • Developer: NetoAI
  • Base Model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
  • Parameters: 1.5 Billion
  • Embedding Dimension: 1536
  • Max Input Tokens: 32,000
  • Languages: Multilingual (optimized for English)
  • License: MIT
  • Tokenizer: Custom telecom-specific tokenizer (open-source)

Intended Uses

  • Semantic search over telecom documents (3GPP standards, vendor manuals)
  • Fault log analysis for root-cause detection
  • Telecom-specific chatbots and Q&A systems
  • Regulatory compliance analysis and semantic auditing

Training Details

  • Objective: Triplet loss using cosine similarity
  • Dataset: 100k+ telecom triplets curated by domain experts over 3 months
  • Layer Modification: 338 transformer layers fine-tuned
  • Avg. L2 Norm Weight Change: 0.7735
  • Enhancements: Telecom-specific tokenizer and query-aware anchor strategies

Evaluation Results

Benchmark Metric Score
Telecom Triplet Benchmark Accuracy 0.9380
MTEB Benchmark Accuracy 0.825
STS Benchmark Spearman Correlation 82.19
AllNLI Triplet Accuracy 0.6150

T-VEC significantly outperforms both its base model and other strong general-purpose models on telecom-specific benchmarks, while still retaining competitive general performance.

Model ArguAna SciDocsRR STS12 STS13 STS14 STS15 STS16 STSBenchmark
gte‑Qwen2‑1.5B‑instruct 0.62335 0.81558 0.72805 0.84699 0.78803 0.87450 0.84938 0.85379
T‑VEC 0.61150 0.83970 0.80320 0.88220 0.82750 0.88260 0.84780 0.88050
all‑MiniLM‑L6‑v2 0.50167 0.87119 0.72369 0.80603 0.75589 0.85390 0.78989 0.82032
all‑mpnet‑base‑v2 0.46521 0.88654 0.72634 0.83485 0.78000 0.85663 0.80030 0.83422
bge‑base‑en‑v1.5 0.63616 0.87494 0.78028 0.84184 0.82273 0.87957 0.85474 0.86418
e5‑base‑v2 0.51604 0.82834 0.73489 0.82997 0.80446 0.88181 0.83659 0.85480
jina‑embeddings‑v2‑base‑en 0.44152 0.83106 0.74278 0.84177 0.78808 0.87553 0.85347 0.84842
instructor‑xl 0.54884 0.79538 0.74085 0.85046 0.80318 0.88359 0.83784 0.83048
gte‑base 0.57151 0.87083 0.75707 0.85729 0.81510 0.88810 0.83824 0.85738
multilingual‑e5‑base 0.47829 0.80392 0.77933 0.76890 0.77535 0.88373 0.82699 0.84201

image/png

Limitations

  • Reduced performance on non-domain tasks (e.g., AllNLI) due to specialization
  • Large size may impact deployment on edge devices
  • May miss recent telecom developments outside the training set

Ethical Considerations

  • Use in critical telecom systems should be validated by domain experts
  • May reflect terminology biases from dominant vendors in the dataset
  • Open licensing (MIT) supports transparency and community contributions

Usage

Installation

pip install transformers

Load and Run

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("netoai/t-vec")
tokenizer = AutoTokenizer.from_pretrained("netoai/t-vec")

texts = ["5G NR architecture", "LTE handover", "Core network functions"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=32000)
emb = model(**inputs).last_hidden_state.mean(dim=1)

cos_sim = torch.nn.functional.cosine_similarity(emb[0:1], emb[1:], dim=1)
print(cos_sim)

Citation

@article{ethiraj2025tvec,
  title={T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning},
  author={Ethiraj, Vignesh and Menon, Sidhanth and Vijay, Divya},
  journal={arXiv preprint},
  year={2025},
  url={https://arxiv.org/abs/2504.16460}
}

References

  • Ethiraj, V., Menon, S., & Vijay, D. (2025). T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning. arXiv:2504.16460.
  • Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR, 2015.
  • Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” arXiv:1703.07737, 2017.
  • Reimers, N., Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019.
  • Gao, T., Yao, X., Chen, D. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv:2104.08821, 2021.
  • Gururangan, S., et al. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” ACL, 2020.
  • Lee, J., Yoon, W., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics, 2020.
  • Sahu, S. K., Maheshwari, A. “Automatic extraction of telecom network events from log messages.” IEEE ICC, 2018.
  • Wang, X., Li, Y., Han, J. “Log2Vec: A Deep Embedding Model for Network Log Analysis.” IEEE/IFIP DSN, 2021.

Contact