Model Card for Led_sgd_summarizer_250_sp

Model Description

Overview
Led_sgd_summarizer_250_sp is a fine-tuned version of allenai/led-large-16384 designed for abstractive text summarization of long Spanish texts (up to 16,384 tokens). It generates concise summaries (~350 characters) for documents such as governmental petitions and legal correspondence. Built using the Hugging Face Transformers library, it leverages the Longformer Encoder-Decoder (LED) architecture for efficient processing of extended sequences.

Intended Use

  • Primary Use: Summarizing long Spanish texts, particularly in governmental or legal domains.
  • Users: Researchers, developers, and organizations working with Spanish text summarization.
  • Out-of-Scope: Real-time applications, non-Spanish languages, or very short texts.

Ethical Considerations

  • Bias: May reflect biases in the training dataset. Summaries should be reviewed for fairness.
  • Misinformation: Summaries may omit key details. Verify outputs for critical applications.
  • Environmental Impact: Training required significant computational resources, mitigated by FP16 and gradient checkpointing.

Model Details

  • Model Name: Led_sgd_summarizer_250_sp
  • Base Model: allenai/led-large-16384
  • Architecture: Sequence-to-Sequence (Seq2Seq) Transformer (Longformer Encoder-Decoder, LED)
  • Parameters: ~460M
  • Tokenizer: AutoTokenizer from allenai/led-large-16384 (BART-based, fast)
  • Framework: PyTorch
  • Hardware: NVIDIA A100-SXM4-80GB GPU
  • Developed By: excribe.co
  • Author: excribe.co
  • Date: 2025-04-26
  • License: CC-BY-3.0
  • Training Details:
    • Script: Custom Python script using Hugging Face transformers, datasets, evaluate, and accelerate.
    • Hyperparameters:
      • Learning Rate: 5e-6
      • Batch Size: 1 (effective batch size 32 with gradient accumulation)
      • Gradient Accumulation Steps: 32
      • Epochs: 5
      • Optimizer: AdamW (weight decay: 0.01)
      • Precision: FP16
      • Gradient Clipping: max_norm=1.0
      • Early Stopping: Patience of 2, based on eval_rougeL
      • Generation Parameters:
        • Num Beams: 4
        • Max Length: 250
        • Min Length: 50
        • Length Penalty: 1.0
        • No Repeat N-gram Size: 3
    • Optimization:
      • Gradient checkpointing to reduce memory usage.
      • Model cache disabled during training to save VRAM.
    • Training Metrics:
      • Train Loss: 0.8424
      • Train Runtime: 14,992.38 seconds (~4 hours, 9 minutes)
      • Train Samples per Second: 2.477
      • Train Steps per Second: 0.077
      • Epochs Completed: 3.2197
      • Total FLOPs: 6.849736835009741e+16
      • Train Samples: 7,428

Training Data

  • Name: Custom Spanish Summarization Dataset
  • Source: Proprietary .parquet file
  • Size: 8,254 records
  • Columns:
    • texto_entrada: Input text (long Spanish texts, e.g., petitions, official correspondence)
    • asunto: Target summary (~350 characters)
  • Preprocessing:
    • Removed HTML tags using BeautifulSoup.
    • Normalized text (extra whitespaces removed) using regex.
    • Filtered invalid records (empty texts, summaries < 5 chars, texts < 10 chars).
  • Split:
    • Train: 7,428 records (90%)
    • Validation: 826 records (10%)
  • Language: Spanish
  • Access: Contact excribe.co for inquiries.

Model Usage

Installation

Install required dependencies:

pip install transformers datasets evaluate rouge_score nltk torch accelerate sentencepiece beautifulsoup4

Using the Pipeline

For easy inference:

from transformers import pipeline

summarizer = pipeline(
    "summarization",
    model="excribe-co/Led_sgd_summarizer_250_sp",
    tokenizer="excribe-co/Led_sgd_summarizer_250_sp",
    device=0 if torch.cuda.is_available() else -1
)

text = "Radicador de correo electronico Orfeo *20242200099852* Rad. No. 20242200099852 ..."
summary = summarizer(
    text,
    max_length=250,
    min_length=50,
    num_beams=4,
    length_penalty=1.0,
    no_repeat_ngram_size=3,
    truncation=True
)
print("Generated Summary:", summary[0]["summary_text"])

Manual Inference

For more control:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model = AutoModelForSeq2SeqLM.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
tokenizer = AutoTokenizer.from_pretrained("excribe-co/Led_sgd_summarizer_250_sp")
model.eval()

text = "Your long text here..."
inputs = tokenizer(
    "summarize: " + text,
    max_length=16384,
    truncation=True,
    return_tensors="pt",
    padding=True
)

global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
global_attention_mask = global_attention_mask.to(device)

outputs = model.generate(
    **inputs,
    global_attention_mask=global_attention_mask,
    max_length=250,
    min_length=50,
    num_beams=4,
    length_penalty=1.0,
    no_repeat_ngram_size=3
)

summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Generated Summary:", summary)

Evaluation Results

  • Metrics:
    • Eval Loss: 1.0227
    • Generated Length (gen_len): 60.66
    • Eval Runtime: 1,807.29 seconds
    • Eval Samples per Second: 0.457
    • Eval Steps per Second: 0.229
    • Eval Samples: 826
  • Note: ROUGE metrics were computed but reported as zero, likely due to an issue in computation or data preprocessing. Generated summaries are coherent, but ROUGE scoring requires further investigation.

Additional Information

Limitations

  • Computational Requirements: Requires ~48GB VRAM for training, ~16GB for inference with FP16.
  • Language: Optimized for Spanish; untested on other languages.
  • Domain Specificity: Best for governmental and legal documents; may underperform on out-of-domain texts.
  • Inference Speed: Slow for very long texts due to model size and sequence length.

Citation

@misc{excribe2025ledsgdsummarizer,
  author = {excribe.co},
  title = {Led_sgd_summarizer_250_sp: Fine-Tuned LED for Long-Text Summarization in Spanish},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp}
}

Contact

For questions, contact excribe.co or open an issue at https://huggingface.co/excribe-co/Led_sgd_summarizer_250_sp.

Acknowledgments

  • Built upon allenai/led-large-16384 from Hugging Face.
  • Thanks to Hugging Face for transformers, datasets, and evaluate libraries.
  • Training supported by excribe.co.
Downloads last month
0
Safetensors
Model size
460M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for excribe/Led_sgd_summarizer_250_sp

Finetuned
(7)
this model