MeetingScript / README.md
Shaelois's picture
Update README.md
e7001d2 verified
metadata
license: apache-2.0
datasets:
  - huuuyeah/meetingbank
language:
  - en
metrics:
  - rouge
base_model:
  - google/bigbird-pegasus-large-bigpatent
pipeline_tag: summarization
library_name: transformers

MeetingScript

A BigBird‐Pegasus model fine‑tuned for meeting transcription summarization on the MeetingBank dataset.

📦 Model Files

  • Weights & config: pytorch_model.bin, config.json
  • Tokenizer: tokenizer.json, tokenizer_config.json, merges.txt, special_tokens_map.json
  • Generation defaults: generation_config.json

🔗 Hub: https://github.com/kevin0437/Meeting_scripts


Model Description

MeetingScript is a sequence‑to‑sequence model based on
google/bigbird-pegasus-large-bigpatent
and fine‑tuned on the MeetingBank corpus of meeting transcripts paired with human‐written summaries.
It is designed to take long meeting transcripts (up to 4096 tokens) and produce concise, coherent summaries.


Evaluation Results

Evaluated on the held‑out test split of MeetingBank (≈ 600 transcripts), using beam search (4 beams, max_length=600):

Metric F1 Score (%)
ROUGE‑1 51.5556
ROUGE‑2 38.5378
ROUGE‑L 48.0786
ROUGE‑Lsum 48.0142

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# 1) Load from the Hub
tokenizer = AutoTokenizer.from_pretrained("Shaelois/MeetingScript")
model = AutoModelForSeq2SeqLM.from_pretrained("Shaelois/MeetingScript")

# 2) Summarize a long transcript
transcript = """
    Alice: Good morning everyone, let’s get started…
    Bob: I updated the design mockups…
    … (thousands of words) …
"""
inputs = tokenizer(
    transcript,
    max_length=4096,
    truncation=True,
    return_tensors="pt"
).to("cuda")

summary_ids = model.generate(
    **inputs,
    num_beams=4,
    max_length=150,
    early_stopping=True
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("📝 Summary:", summary)

Training Data

Dataset: MeetingBank Splits: Train (5000+), Validation (600+), Test (600+) Preprocessing: Sliding‑window chunking for sequences > 4096 tokens