metadata

license: mit
language:
  - en
tags:
  - text-embeddings
  - telecom
  - domain-adaptation
  - triplet-loss
  - transformer
  - semantic-search
  - sentence-transformers
  - domain-specific
  - contrastive-learning
  - simcse
  - bio-bert
  - don’t-stop-pretraining
metrics:
  - name: Telecom Triplet Score
    type: accuracy
    value: 0.938
    verified: false
  - name: Average MTEB Score
    type: accuracy
    value: 0.825
    verified: false
  - name: Average STS Score
    type: spearman
    value: 82.19
    verified: false
  - name: AllNLI Triplet Score
    type: accuracy
    value: 0.615
    verified: false
base_model:
  - Alibaba-NLP/gte-Qwen2-1.5B-instruct
model-index:
  - name: T-VEC
    results:
      - task:
          type: text-embedding
          name: Telecom Triplet Benchmark
        dataset:
          type: custom
          name: Telecom Triplet Benchmark
        metrics:
          - name: Telecom Triplet Score
            type: accuracy
            value: 0.938
            verified: false
      - task:
          type: text-embedding
          name: MTEB Benchmark
        dataset:
          type: openai_humaneval
          name: MTEB Benchmark
        metrics:
          - name: Average MTEB Score
            type: accuracy
            value: 0.825
            verified: false
      - task:
          type: text-embedding
          name: STS Benchmark
        dataset:
          type: openai_humaneval
          name: STS Benchmark
        metrics:
          - name: Average STS Score
            type: spearman
            value: 82.19
            verified: false
      - task:
          type: text-embedding
          name: AllNLI Triplet
        dataset:
          type: openai_humaneval
          name: AllNLI Triplet
        metrics:
          - name: Triplet Score
            type: accuracy
            value: 0.615
            verified: false
extra_gated_prompt: Please provide answers to the below questions to gain access to the model
extra_gated_fields:
  Company: text
  Full Name: text
  Email: text
  I want to use this model for:
    type: select
    options:
      - Research
      - Education
      - Commercial
      - label: Other
        value: other

T-VEC: A Telecom-Specific Text Embedding Model

Overview

T-VEC (Telecom Vectorization Model) is a domain-adapted text embedding model developed by NetoAI and fine-tuned from Alibaba-NLP/gte-Qwen2-1.5B-instruct. Using a deeply supervised triplet-loss approach, T-VEC learns rich semantic representations tailored to telecom use cases, achieving state-of-the-art results on custom and standard benchmarks.

Model Details

Model Name: T-VEC
Developer: NetoAI
Base Model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
Parameters: 1.5 Billion
Embedding Dimension: 1536
Max Input Tokens: 32,000
Languages: Multilingual (optimized for English)
License: MIT
Tokenizer: Custom telecom-specific tokenizer (open-source)

Intended Uses

Semantic search over telecom documents (3GPP standards, vendor manuals)
Fault log analysis for root-cause detection
Telecom-specific chatbots and Q&A systems
Regulatory compliance analysis and semantic auditing

Training Details

Objective: Triplet loss using cosine similarity
Dataset: 100k+ telecom triplets curated by domain experts over 3 months
Layer Modification: 338 transformer layers fine-tuned
Avg. L2 Norm Weight Change: 0.7735
Enhancements: Telecom-specific tokenizer and query-aware anchor strategies

Evaluation Results

Benchmark	Metric	Score
Telecom Triplet Benchmark	Accuracy	0.9380
MTEB Benchmark	Accuracy	0.825
STS Benchmark	Spearman Correlation	82.19
AllNLI Triplet	Accuracy	0.6150

T-VEC significantly outperforms both its base model and other strong general-purpose models on telecom-specific benchmarks, while still retaining competitive general performance.

Model	ArguAna	SciDocsRR	STS12	STS13	STS14	STS15	STS16	STSBenchmark
gte‑Qwen2‑1.5B‑instruct	0.62335	0.81558	0.72805	0.84699	0.78803	0.87450	0.84938	0.85379
T‑VEC	0.61150	0.83970	0.80320	0.88220	0.82750	0.88260	0.84780	0.88050
all‑MiniLM‑L6‑v2	0.50167	0.87119	0.72369	0.80603	0.75589	0.85390	0.78989	0.82032
all‑mpnet‑base‑v2	0.46521	0.88654	0.72634	0.83485	0.78000	0.85663	0.80030	0.83422
bge‑base‑en‑v1.5	0.63616	0.87494	0.78028	0.84184	0.82273	0.87957	0.85474	0.86418
e5‑base‑v2	0.51604	0.82834	0.73489	0.82997	0.80446	0.88181	0.83659	0.85480
jina‑embeddings‑v2‑base‑en	0.44152	0.83106	0.74278	0.84177	0.78808	0.87553	0.85347	0.84842
instructor‑xl	0.54884	0.79538	0.74085	0.85046	0.80318	0.88359	0.83784	0.83048
gte‑base	0.57151	0.87083	0.75707	0.85729	0.81510	0.88810	0.83824	0.85738
multilingual‑e5‑base	0.47829	0.80392	0.77933	0.76890	0.77535	0.88373	0.82699	0.84201

Limitations

Reduced performance on non-domain tasks (e.g., AllNLI) due to specialization
Large size may impact deployment on edge devices
May miss recent telecom developments outside the training set

Ethical Considerations

Use in critical telecom systems should be validated by domain experts
May reflect terminology biases from dominant vendors in the dataset
Open licensing (MIT) supports transparency and community contributions

Usage

Installation

pip install transformers

Load and Run

from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("netoai/t-vec")
tokenizer = AutoTokenizer.from_pretrained("netoai/t-vec")

texts = ["5G NR architecture", "LTE handover", "Core network functions"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=32000)
emb = model(**inputs).last_hidden_state.mean(dim=1)

cos_sim = torch.nn.functional.cosine_similarity(emb[0:1], emb[1:], dim=1)
print(cos_sim)

Citation

@article{ethiraj2025tvec,
  title={T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning},
  author={Ethiraj, Vignesh and Menon, Sidhanth and Vijay, Divya},
  journal={arXiv preprint},
  year={2025},
  url={https://arxiv.org/abs/2504.16460}
}

References

Ethiraj, V., Menon, S., & Vijay, D. (2025). T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning. arXiv:2504.16460.
Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR, 2015.
Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” arXiv:1703.07737, 2017.
Reimers, N., Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019.
Gao, T., Yao, X., Chen, D. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv:2104.08821, 2021.
Gururangan, S., et al. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” ACL, 2020.
Lee, J., Yoon, W., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics, 2020.
Sahu, S. K., Maheshwari, A. “Automatic extraction of telecom network events from log messages.” IEEE ICC, 2018.
Wang, X., Li, Y., Han, J. “Log2Vec: A Deep Embedding Model for Network Log Analysis.” IEEE/IFIP DSN, 2021.

Contact

For questions or contributions, visit https://www.netoai.ai.