Model Card for Led_sgd_summarizer_250_sp
Model Description
OverviewLed_sgd_summarizer_250_sp
is a fine-tuned version of allenai/led-large-16384
designed for abstractive text summarization of long Spanish texts (up to 16,384 tokens). It generates concise summaries (~350 characters) for documents such as governmental petitions and legal correspondence. Built using the Hugging Face Transformers library, it leverages the Longformer Encoder-Decoder (LED) architecture for efficient processing of extended sequences.
Intended Use
- Primary Use: Summarizing long Spanish texts, particularly in governmental or legal domains.
- Users: Researchers, developers, and organizations working with Spanish text summarization.
- Out-of-Scope: Real-time applications, non-Spanish languages, or very short texts.
Ethical Considerations
- Bias: May reflect biases in the training dataset. Summaries should be reviewed for fairness.
- Misinformation: Summaries may omit key details. Verify outputs for critical applications.
- Environmental Impact: Training required significant computational resources, mitigated by FP16 and gradient checkpointing.
Model Details
- Model Name:
Led_sgd_summarizer_250_sp
- Base Model:
allenai/led-large-16384
- Architecture: Sequence-to-Sequence (Seq2Seq) Transformer (Longformer Encoder-Decoder, LED)
- Parameters: ~460M
- Tokenizer:
AutoTokenizer
fromallenai/led-large-16384
(BART-based, fast) - Framework: PyTorch
- Hardware: NVIDIA A100-SXM4-80GB GPU
- Developed By: excribe.co
- Author: excribe.co
- Date: 2025-04-26
- License: CC-BY-3.0
- Training Details:
- Script: Custom Python script using Hugging Face
transformers
,datasets
,evaluate
, andaccelerate
. - Hyperparameters:
- Learning Rate: 5e-6
- Batch Size: 1 (effective batch size 32 with gradient accumulation)
- Gradient Accumulation Steps: 32
- Epochs: 5
- Optimizer: AdamW (weight decay: 0.01)
- Precision: FP16
- Gradient Clipping: max_norm=1.0
- Early Stopping: Patience of 2, based on
eval_rougeL
- Generation Parameters:
- Num Beams: 4
- Max Length: 250
- Min Length: 50
- Length Penalty: 1.0
- No Repeat N-gram Size: 3
- Optimization:
- Gradient checkpointing to reduce memory usage.
- Model cache disabled during training to save VRAM.
- Training Metrics:
- Train Loss: 0.8424
- Train Runtime: 14,992.38 seconds (~4 hours, 9 minutes)
- Train Samples per Second: 2.477
- Train Steps per Second: 0.077
- Epochs Completed: 3.2197
- Total FLOPs: 6.849736835009741e+16
- Train Samples: 7,428
- Script: Custom Python script using Hugging Face
Training Data
- Name: Custom Spanish Summarization Dataset
- Source: Proprietary
.parquet
file - Size: 8,254 records
- Columns:
texto_entrada
: Input text (long Spanish texts, e.g., petitions, official correspondence)asunto
: Target summary (~350 characters)
- Preprocessing:
- Removed HTML tags using
BeautifulSoup
. - Normalized text (extra whitespaces removed) using regex.
- Filtered invalid records (empty texts, summaries < 5 chars, texts < 10 chars).
- Removed HTML tags using
- Split:
- Train: 7,428 records (90%)
- Validation: 826 records (10%)
- Language: Spanish
- Access: Contact excribe.co for inquiries.
Model Usage
Installation
Install required dependencies:
pip install transformers datasets evaluate rouge_score nltk torch accelerate sentencepiece beautifulsoup4
Using the Pipeline
For easy inference:
from transformers import pipeline
summarizer = pipeline(
"summarization",
model="excribe-co/Led_sgd_summarizer_250_sp",
tokenizer="excribe-co/Led_sgd_summarizer_250_sp",
device=0 if torch.cuda.is_available() else -1
)
text = "Radicador de correo electronico Orfeo *20242200099852* Rad. No. 20242200099852 ..."
summary = summarizer(
text,
max_length=250,
min_length=50,
num_beams=4,
length_penalty=1.0,
no_repeat_ngram_size=3,
truncation=True
)
print("Generated Summary:", summary[0]["summary_text"])
Manual Inference
For more control:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
model = AutoModelForSeq2SeqLM.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
tokenizer = AutoTokenizer.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
model.eval()
text = "Your long text here..."
inputs = tokenizer(
"summarize: " + text,
max_length=16384,
truncation=True,
return_tensors="pt",
padding=True
)
global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
global_attention_mask = global_attention_mask.to(device)
outputs = model.generate(
**inputs,
global_attention_mask=global_attention_mask,
max_length=250,
min_length=50,
num_beams=4,
length_penalty=1.0,
no_repeat_ngram_size=3
)
summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Summary:", summary)
Evaluation Results
- Metrics:
- Eval Loss: 1.0227
- Generated Length (gen_len): 60.66
- Eval Runtime: 1,807.29 seconds
- Eval Samples per Second: 0.457
- Eval Steps per Second: 0.229
- Eval Samples: 826
- Note: ROUGE metrics were computed but reported as zero, likely due to an issue in computation or data preprocessing. Generated summaries are coherent, but ROUGE scoring requires further investigation.
Additional Information
Limitations
- Computational Requirements: Requires ~48GB VRAM for training, ~16GB for inference with FP16.
- Language: Optimized for Spanish; untested on other languages.
- Domain Specificity: Best for governmental and legal documents; may underperform on out-of-domain texts.
- Inference Speed: Slow for very long texts due to model size and sequence length.
Citation
@misc{excribe2025ledsgdsummarizer,
author = {excribe.co},
title = {Led_sgd_summarizer_250_sp: Fine-Tuned LED for Long-Text Summarization in Spanish},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp}
}
Contact
For questions, contact excribe.co or open an issue at https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp.
Acknowledgments
- Built upon
allenai/led-large-16384
from Hugging Face. - Thanks to Hugging Face for
transformers
,datasets
, andevaluate
libraries. - Training supported by excribe.co.
- Downloads last month
- 0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for excribe/Led_sgd_summarizer_250_sp
Base model
allenai/led-large-16384