metadata
license: mit
language:
- en
tags:
- text-embeddings
- telecom
- domain-adaptation
- triplet-loss
- transformer
- semantic-search
- sentence-transformers
- domain-specific
- contrastive-learning
- simcse
- bio-bert
- don’t-stop-pretraining
metrics:
- name: Telecom Triplet Score
type: accuracy
value: 0.938
verified: false
- name: Average MTEB Score
type: accuracy
value: 0.825
verified: false
- name: Average STS Score
type: spearman
value: 82.19
verified: false
- name: AllNLI Triplet Score
type: accuracy
value: 0.615
verified: false
base_model:
- Alibaba-NLP/gte-Qwen2-1.5B-instruct
model-index:
- name: T-VEC
results:
- task:
type: text-embedding
name: Telecom Triplet Benchmark
dataset:
type: custom
name: Telecom Triplet Benchmark
metrics:
- name: Telecom Triplet Score
type: accuracy
value: 0.938
verified: false
- task:
type: text-embedding
name: MTEB Benchmark
dataset:
type: openai_humaneval
name: MTEB Benchmark
metrics:
- name: Average MTEB Score
type: accuracy
value: 0.825
verified: false
- task:
type: text-embedding
name: STS Benchmark
dataset:
type: openai_humaneval
name: STS Benchmark
metrics:
- name: Average STS Score
type: spearman
value: 82.19
verified: false
- task:
type: text-embedding
name: AllNLI Triplet
dataset:
type: openai_humaneval
name: AllNLI Triplet
metrics:
- name: Triplet Score
type: accuracy
value: 0.615
verified: false
extra_gated_prompt: Please provide answers to the below questions to gain access to the model
extra_gated_fields:
Company: text
Full Name: text
Email: text
I want to use this model for:
type: select
options:
- Research
- Education
- Commercial
- label: Other
value: other
T-VEC: A Telecom-Specific Text Embedding Model
Overview
T-VEC (Telecom Vectorization Model) is a domain-adapted text embedding model developed by NetoAI and fine-tuned from Alibaba-NLP/gte-Qwen2-1.5B-instruct. Using a deeply supervised triplet-loss approach, T-VEC learns rich semantic representations tailored to telecom use cases, achieving state-of-the-art results on custom and standard benchmarks.
Model Details
- Model Name: T-VEC
- Developer: NetoAI
- Base Model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
- Parameters: 1.5 Billion
- Embedding Dimension: 1536
- Max Input Tokens: 32,000
- Languages: Multilingual (optimized for English)
- License: MIT
- Tokenizer: Custom telecom-specific tokenizer (open-source)
Intended Uses
- Semantic search over telecom documents (3GPP standards, vendor manuals)
- Fault log analysis for root-cause detection
- Telecom-specific chatbots and Q&A systems
- Regulatory compliance analysis and semantic auditing
Training Details
- Objective: Triplet loss using cosine similarity
- Dataset: 100k+ telecom triplets curated by domain experts over 3 months
- Layer Modification: 338 transformer layers fine-tuned
- Avg. L2 Norm Weight Change: 0.7735
- Enhancements: Telecom-specific tokenizer and query-aware anchor strategies
Evaluation Results
Benchmark | Metric | Score |
---|---|---|
Telecom Triplet Benchmark | Accuracy | 0.9380 |
MTEB Benchmark | Accuracy | 0.825 |
STS Benchmark | Spearman Correlation | 82.19 |
AllNLI Triplet | Accuracy | 0.6150 |
T-VEC significantly outperforms both its base model and other strong general-purpose models on telecom-specific benchmarks, while still retaining competitive general performance.
Model | ArguAna | SciDocsRR | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark |
---|---|---|---|---|---|---|---|---|
gte‑Qwen2‑1.5B‑instruct | 0.62335 | 0.81558 | 0.72805 | 0.84699 | 0.78803 | 0.87450 | 0.84938 | 0.85379 |
T‑VEC | 0.61150 | 0.83970 | 0.80320 | 0.88220 | 0.82750 | 0.88260 | 0.84780 | 0.88050 |
all‑MiniLM‑L6‑v2 | 0.50167 | 0.87119 | 0.72369 | 0.80603 | 0.75589 | 0.85390 | 0.78989 | 0.82032 |
all‑mpnet‑base‑v2 | 0.46521 | 0.88654 | 0.72634 | 0.83485 | 0.78000 | 0.85663 | 0.80030 | 0.83422 |
bge‑base‑en‑v1.5 | 0.63616 | 0.87494 | 0.78028 | 0.84184 | 0.82273 | 0.87957 | 0.85474 | 0.86418 |
e5‑base‑v2 | 0.51604 | 0.82834 | 0.73489 | 0.82997 | 0.80446 | 0.88181 | 0.83659 | 0.85480 |
jina‑embeddings‑v2‑base‑en | 0.44152 | 0.83106 | 0.74278 | 0.84177 | 0.78808 | 0.87553 | 0.85347 | 0.84842 |
instructor‑xl | 0.54884 | 0.79538 | 0.74085 | 0.85046 | 0.80318 | 0.88359 | 0.83784 | 0.83048 |
gte‑base | 0.57151 | 0.87083 | 0.75707 | 0.85729 | 0.81510 | 0.88810 | 0.83824 | 0.85738 |
multilingual‑e5‑base | 0.47829 | 0.80392 | 0.77933 | 0.76890 | 0.77535 | 0.88373 | 0.82699 | 0.84201 |
Limitations
- Reduced performance on non-domain tasks (e.g., AllNLI) due to specialization
- Large size may impact deployment on edge devices
- May miss recent telecom developments outside the training set
Ethical Considerations
- Use in critical telecom systems should be validated by domain experts
- May reflect terminology biases from dominant vendors in the dataset
- Open licensing (MIT) supports transparency and community contributions
Usage
Installation
pip install transformers
Load and Run
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained("netoai/t-vec")
tokenizer = AutoTokenizer.from_pretrained("netoai/t-vec")
texts = ["5G NR architecture", "LTE handover", "Core network functions"]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=32000)
emb = model(**inputs).last_hidden_state.mean(dim=1)
cos_sim = torch.nn.functional.cosine_similarity(emb[0:1], emb[1:], dim=1)
print(cos_sim)
Citation
@article{ethiraj2025tvec,
title={T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning},
author={Ethiraj, Vignesh and Menon, Sidhanth and Vijay, Divya},
journal={arXiv preprint},
year={2025},
url={https://arxiv.org/abs/2504.16460}
}
References
- Ethiraj, V., Menon, S., & Vijay, D. (2025). T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning. arXiv:2504.16460.
- Schroff, F., Kalenichenko, D., Philbin, J. “FaceNet: A Unified Embedding for Face Recognition and Clustering.” CVPR, 2015.
- Hermans, A., Beyer, L., Leibe, B. “In Defense of the Triplet Loss for Person Re-Identification.” arXiv:1703.07737, 2017.
- Reimers, N., Gurevych, I. “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” EMNLP, 2019.
- Gao, T., Yao, X., Chen, D. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” arXiv:2104.08821, 2021.
- Gururangan, S., et al. “Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks.” ACL, 2020.
- Lee, J., Yoon, W., et al. “BioBERT: a pre-trained biomedical language representation model for biomedical text mining.” Bioinformatics, 2020.
- Sahu, S. K., Maheshwari, A. “Automatic extraction of telecom network events from log messages.” IEEE ICC, 2018.
- Wang, X., Li, Y., Han, J. “Log2Vec: A Deep Embedding Model for Network Log Analysis.” IEEE/IFIP DSN, 2021.
Contact
- For questions or contributions, visit https://www.netoai.ai.