Model Card for Led_sgd_summarizer_250_sp

Model Description

Overview
Led_sgd_summarizer_250_sp is a fine-tuned version of allenai/led-large-16384 designed for abstractive text summarization of long Spanish texts (up to 16,384 tokens). It generates concise summaries (~350 characters) for documents such as governmental petitions and legal correspondence. Built using the Hugging Face Transformers library, it leverages the Longformer Encoder-Decoder (LED) architecture for efficient processing of extended sequences.

Intended Use

Primary Use: Summarizing long Spanish texts, particularly in governmental or legal domains.
Users: Researchers, developers, and organizations working with Spanish text summarization.
Out-of-Scope: Real-time applications, non-Spanish languages, or very short texts.

Ethical Considerations

Bias: May reflect biases in the training dataset. Summaries should be reviewed for fairness.
Misinformation: Summaries may omit key details. Verify outputs for critical applications.
Environmental Impact: Training required significant computational resources, mitigated by FP16 and gradient checkpointing.

Model Details

Model Name: Led_sgd_summarizer_250_sp
Base Model: allenai/led-large-16384
Architecture: Sequence-to-Sequence (Seq2Seq) Transformer (Longformer Encoder-Decoder, LED)
Parameters: ~460M
Tokenizer: AutoTokenizer from allenai/led-large-16384 (BART-based, fast)
Framework: PyTorch
Hardware: NVIDIA A100-SXM4-80GB GPU
Developed By: excribe.co
Author: excribe.co
Date: 2025-04-26
License: CC-BY-3.0
Training Details:
- Script: Custom Python script using Hugging Face transformers, datasets, evaluate, and accelerate.
- Hyperparameters:
  - Learning Rate: 5e-6
  - Batch Size: 1 (effective batch size 32 with gradient accumulation)
  - Gradient Accumulation Steps: 32
  - Epochs: 5
  - Optimizer: AdamW (weight decay: 0.01)
  - Precision: FP16
  - Gradient Clipping: max_norm=1.0
  - Early Stopping: Patience of 2, based on eval_rougeL
  - Generation Parameters:
    - Num Beams: 4
    - Max Length: 250
    - Min Length: 50
    - Length Penalty: 1.0
    - No Repeat N-gram Size: 3
- Optimization:
  - Gradient checkpointing to reduce memory usage.
  - Model cache disabled during training to save VRAM.
- Training Metrics:
  - Train Loss: 0.8424
  - Train Runtime: 14,992.38 seconds (~4 hours, 9 minutes)
  - Train Samples per Second: 2.477
  - Train Steps per Second: 0.077
  - Epochs Completed: 3.2197
  - Total FLOPs: 6.849736835009741e+16
  - Train Samples: 7,428

Training Data

Name: Custom Spanish Summarization Dataset
Source: Proprietary .parquet file
Size: 8,254 records
Columns:
- texto_entrada: Input text (long Spanish texts, e.g., petitions, official correspondence)
- asunto: Target summary (~350 characters)
Preprocessing:
- Removed HTML tags using BeautifulSoup.
- Normalized text (extra whitespaces removed) using regex.
- Filtered invalid records (empty texts, summaries < 5 chars, texts < 10 chars).
Split:
- Train: 7,428 records (90%)
- Validation: 826 records (10%)
Language: Spanish
Access: Contact excribe.co for inquiries.

Model Usage

Installation

Install required dependencies:

pip install transformers datasets evaluate rouge_score nltk torch accelerate sentencepiece beautifulsoup4

Using the Pipeline

For easy inference:

from transformers import pipeline

summarizer = pipeline(
    "summarization",
    model="excribe-co/Led_sgd_summarizer_250_sp",
    tokenizer="excribe-co/Led_sgd_summarizer_250_sp",
    device=0 if torch.cuda.is_available() else -1
)

text = "Radicador de correo electronico Orfeo *20242200099852* Rad. No. 20242200099852 ..."
summary = summarizer(
    text,
    max_length=250,
    min_length=50,
    num_beams=4,
    length_penalty=1.0,
    no_repeat_ngram_size=3,
    truncation=True
)
print("Generated Summary:", summary[0]["summary_text"])

Manual Inference

For more control:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model = AutoModelForSeq2SeqLM.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
tokenizer = AutoTokenizer.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
model.eval()

text = "Your long text here..."
inputs = tokenizer(
    "summarize: " + text,
    max_length=16384,
    truncation=True,
    return_tensors="pt",
    padding=True
)

global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
global_attention_mask = global_attention_mask.to(device)

outputs = model.generate(
    **inputs,
    global_attention_mask=global_attention_mask,
    max_length=250,
    min_length=50,
    num_beams=4,
    length_penalty=1.0,
    no_repeat_ngram_size=3
)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Summary:", summary)

Evaluation Results

Metrics:
- Eval Loss: 1.0227
- Generated Length (gen_len): 60.66
- Eval Runtime: 1,807.29 seconds
- Eval Samples per Second: 0.457
- Eval Steps per Second: 0.229
- Eval Samples: 826
Note: ROUGE metrics were computed but reported as zero, likely due to an issue in computation or data preprocessing. Generated summaries are coherent, but ROUGE scoring requires further investigation.

Additional Information

Limitations

Computational Requirements: Requires ~48GB VRAM for training, ~16GB for inference with FP16.
Language: Optimized for Spanish; untested on other languages.
Domain Specificity: Best for governmental and legal documents; may underperform on out-of-domain texts.
Inference Speed: Slow for very long texts due to model size and sequence length.

Citation

@misc{excribe2025ledsgdsummarizer,
  author = {excribe.co},
  title = {Led_sgd_summarizer_250_sp: Fine-Tuned LED for Long-Text Summarization in Spanish},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp}
}

Contact

For questions, contact excribe.co or open an issue at https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp.

Acknowledgments

Built upon allenai/led-large-16384 from Hugging Face.
Thanks to Hugging Face for transformers, datasets, and evaluate libraries.
Training supported by excribe.co.

excribe
/

Led_sgd_summarizer_250_sp